Zigang Geng ^(1,3){ }^{1,3}, Chunyu Wang ^(3**){ }^{3 *}, Yixuan Wei^(2,3)\mathrm{Wei}^{2,3}, Ze Liu ^(1,3){ }^{1,3}, Houqiang Li^(1)\mathrm{Li}^{1}, Han Hu^(3**)\mathrm{Hu}^{3 *} 耿子刚 ^(1,3){ }^{1,3} 、王春雨 ^(3**){ }^{3 *} 、易璇 Wei^(2,3)\mathrm{Wei}^{2,3} 、刘泽 ^(1,3){ }^{1,3} 、侯强 Li^(1)\mathrm{Li}^{1} 、韩啸 Hu^(3**)\mathrm{Hu}^{3 *}^(1){ }^{1} University of Science and Technology of China ^(2){ }^{2} Tsinghua University ^(3){ }^{3} Microsoft Research Asia ^(1){ }^{1} 中国科学技术大学 ^(2){ }^{2} 清华大学 ^(3){ }^{3} 微软亚洲研究院https://sites.google.com/view/pctpose
Abstract 摘要
Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling between the body joints. In this paper, we present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency. It represents a pose by MM discrete tokens with each characterizing a sub-structure with several interdependent joints (see Figure 1). The compositional design enables it to achieve a small reconstruction error at a low cost. Then we cast pose estimation as a classification task. In particular, we learn a classifier to predict the categories of the MM tokens from an image. A prelearned decoder network is used to recover the pose from the tokens without further post-processing. We show that it achieves better or comparable pose estimation results as the existing methods in general scenarios, yet continues to work well when occlusion occurs, which is ubiquitous in practice. The code and models are publicly available at https://github.com/Gengzigang/PCT. 人体姿态通常由身体关节的坐标向量或其热图嵌入表示。这种方法虽然便于数据处理,但由于缺乏身体关节间的依赖建模,因此会产生不切实际的姿势估计。在本文中,我们提出了一种名为 "姿势作为合成标记"(Pose as Compositional Tokens,PCT)的结构化表示法来探索关节间的依赖关系。它通过 MM 个离散标记来表示姿势,每个标记表示一个具有多个相互依赖关节的子结构(见图 1)。这种组合设计使其能够以较低的成本实现较小的重构误差。然后,我们将姿势估计作为一项分类任务。具体来说,我们学习一个分类器来预测图像中 MM 标记的类别。预学习解码器网络用于从标记恢复姿势,而无需进一步的后处理。我们的研究表明,在一般情况下,它能获得与现有方法更好或相当的姿态估计结果,但在发生遮挡时仍能保持良好的工作状态,而这在实践中是无处不在的。代码和模型可在https://github.com/Gengzigang/PCT上公开获取。
1. Introduction 1.导言
Human pose estimation is a fundamental task in computer vision which aims to estimate the positions of body joints from images. The recent progress has focused on network structures [74, 87, 96], training methods [31, 68, 93], and fusion strategies [14,15,61,67,84,102][14,15,61,67,84,102], which have notably advanced the accuracy on public datasets. However, it remains an open problem in challenging scenarios, e.g., in the presence of occlusion, which hinders its application in practice. 人体姿态估计是计算机视觉领域的一项基本任务,其目的是从图像中估计人体关节的位置。最近的进展主要集中在网络结构 [74, 87, 96]、训练方法 [31, 68, 93] 和融合策略 [14,15,61,67,84,102][14,15,61,67,84,102] ,这些方法显著提高了公共数据集的准确性。然而,在具有挑战性的场景中,例如在有遮挡的情况下,这仍然是一个有待解决的问题,这阻碍了它在实际中的应用。
Current 2/3D pose estimators usually represent a pose by a coordinate vector [23,34,79,110][23,34,79,110] or its heatmap embeddings [40, 55, 60, 74, 75, 80, 87, 90]. In both representations, the joints are treated independently, ignoring the fact that the body joints can serve as mutual context to each 当前的 2/3D 姿势估计器通常用坐标向量 [23,34,79,110][23,34,79,110] 或其热图嵌入来表示姿势 [40, 55, 60, 74, 75, 80, 87, 90]。在这两种表示方法中,关节都是独立处理的,忽略了身体关节可以作为每个关节的相互上下文的事实。
Figure 1. Our approach represents a pose by M discrete tokens which are indices to the codebook entries (top). Each token is learned to represent a sub-structure. In each row, we show that if we change the state of one token to different values, it consistently changes the same sub-structure highlighted by orange. The black poses are before changing (bottom). 图 1我们的方法通过 M 个离散标记来表示一个姿势,这些标记是代码库条目(上图)的索引。每个标记通过学习代表一个子结构。在每一行中,我们可以看到,如果我们将一个标记的状态改变为不同的值,就会持续改变橙色标注的相同子结构。黑色部分为改变前的状态(下图)。
other. As a result, they may get unrealistic estimates when occlusion occurs as shown in Figure 2 (top). However, it is interesting to note that humans can easily predict intact poses from only the visible joints and the visual features. This is probably because people are able to use context to aid recognition as evidenced by some psychology experiments [5,58]. Some works attempt to introduce a tree or graph structure [2,21, 65, 85] to model joint dependency. However, the hand-designed rules usually make unrealistic assumptions on the relationships, making 三n incapable to represent complex patterns. 因此,如图 2(上图)所示,当发生闭塞时,他们可能会得到不切实际的估计值。因此,如图 2(上图)所示,当发生遮挡时,他们可能会得到不切实际的估计。不过,值得注意的是,人类可以仅从可见的关节和视觉特征轻松预测完整的姿势。这可能是因为人们能够利用上下文来帮助识别,一些心理学实验也证明了这一点 [5,58]。一些研究试图引入树状或图状结构 [2,21, 65, 85] 来模拟关节依赖关系。然而,手工设计的规则通常会对关系做出不切实际的假设,从而使三元法无法表示复杂的模式。
In this work, we hope to learn the dependency between the joints earlier in the representation stage without any assumptions. Our initial idea is to learn a set of prototype poses that are realistic, and represent every pose by the nearest prototype. While it can guarantee that all poses are realistic, it requires a large number of prototypes to reduce the quantization error to a reasonable level which is computa- 在这项工作中,我们希望能在不做任何假设的情况下,在表示阶段的早期学习关节之间的依赖关系。我们最初的想法是学习一组逼真的姿势原型,并用最接近的原型来表示每个姿势。虽然这能保证所有姿势都是真实的,但它需要大量的原型来将量化误差降低到合理水平,而量化误差是需要计算的。
Figure 2. Heatmap-based method (top) v.s. our PCT method (bottom) in occluded scenes. PCT predicts reasonable poses even under severe occlusion. The images are from COCO val2017. 图 2.基于热图的方法(上图)与我们的 PCT 方法(下图)在遮挡场景中的对比。即使在严重遮挡的情况下,PCT 也能预测出合理的姿势。图片来自 COCO val2017。
tionally infeasible. Instead, we propose a discrete representation, named pose as compositional tokens (PCT). Figure 3 shows the two stages of the representation. In Stage I, we learn a compositional encoder to transform a pose into MM token features, with each encoding a sub-structure of the pose. See Figure 1 for some examples. Then the tokens are quantized by a shared codebook. So, a pose is simply represented by MM discrete indices. The space represented by the codebook is sufficiently large to represent all poses accurately. We jointly learn the encoder, the codebook, and the decoder by minimizing a reconstruction error. 从理论上讲是不可行的。相反,我们提出了一种离散的表示方法,命名为 "构成标记姿势"(PCT)。图 3 显示了表示的两个阶段。在第一阶段,我们学习合成编码器,将姿势转换为 MM 标记特征,每个标记特征编码姿势的一个子结构。一些例子见图 1。然后,这些标记通过共享编码本进行量化。因此,一个姿势可以简单地用 MM 离散指数来表示。编码本所代表的空间足够大,可以准确地代表所有姿势。我们通过最小化重构误差来共同学习编码器、编码本和解码器。
In Stage II, we cast human pose estimation as a classification task. Given an image, we predict the categories of the MM tokens, from which the pose is recovered by the decoder network. The PCT representation has several advantages. First, the dependency between the joints is modeled by the tokens, which helps to reduce the chance of getting unrealistic pose estimates. In particular, we see evidence that it has the potential to obtain reasonable estimates even when a large portion of the body is occluded. See Figure 2 (bottom) for some examples. Second, it does not require any expensive post-processing modules such as UDP [29] which is required by the heatmap representation to reduce the quantization errors. Third, it provides a unified representation for 2 D and 3 D poses. In addition, the discrete representation potentially facilitates its interactions with other discrete modalities such as text and speech. But this is not the focus of this work. 在第二阶段,我们将人体姿态估计作为一项分类任务。给定图像后,我们预测 MM 标记的类别,解码器网络从中恢复姿势。PCT 表示法有几个优点。首先,关节之间的依赖关系由标记建模,这有助于减少获得不切实际的姿势估计值的几率。特别是,我们看到有证据表明,即使身体的大部分被遮挡,它也有可能获得合理的估计值。一些例子见图 2(底部)。其次,它不需要任何昂贵的后处理模块,如 UDP [29],而热图表示法需要 UDP 来减少量化误差。第三,它为 2 D 和 3 D 姿态提供了统一的表示方法。此外,离散表示可能会促进它与文本和语音等其他离散模态的交互。但这并不是这项工作的重点。
We extensively evaluate our approach in 2D human pose estimation on five benchmark datasets. It gets better or comparable accuracy as the state-of-the-art methods on all of them. But more importantly, it achieves significantly better results when evaluated only on the occluded joints, validating the advantages of its dependency modeling capability. We also present the results in 3D pose estimation on the H36M dataset on which it achieves comparable accuracy 我们在五个基准数据集上广泛评估了我们的二维人体姿态估计方法。在所有这些数据集上,我们的方法都获得了比最先进方法更好或相当的精度。但更重要的是,当只对隐蔽关节进行评估时,它取得了明显更好的结果,验证了其依赖建模能力的优势。我们还介绍了在 H36M 数据集上进行三维姿态估计的结果,它在该数据集上取得了相当高的精度。
as the state-of-the-art methods using a simple architecture. The results demonstrate that it has wide applicability. 与最先进的方法相比,它采用了简单的结构。结果表明,它具有广泛的适用性。
2. Related works 2.相关作品
In this section, we first briefly discuss the widely used pose representations. Then we discuss the methods that explore joint dependencies. 在本节中,我们首先简要讨论广泛使用的姿势表示法。然后,我们讨论探索联合依赖关系的方法。
2.1. Pose representations 2.1.姿势表示法
Coordinates. Early works [4, 9, 50, 56, 76, 79, 110] propose to directly regress the coordinates of body joints from images. While efficient, the accuracy is worse than the heatmap-based methods because it is challenging to learn the highly non-linear mapping. Some works [23, 89] propose to improve them by focusing on local features around the joints. Residual Log-likelihood Estimation [34] proposes a novel regression paradigm to capture the underlying output distribution. MDN [82] introduces mixture density network for regression. Recently, transformer [83] brings notable improvement [36, 49, 71] due to its ability to capture long-range information. 坐标早期的研究 [4, 9, 50, 56, 76, 79, 110] 建议直接从图像中回归身体关节的坐标。这种方法虽然高效,但准确性却不如基于热图的方法,因为学习高度非线性的映射具有挑战性。一些研究 [23, 89] 建议通过关注关节周围的局部特征来改进这些方法。Residual Log-likelihood Estimation [34] 提出了一种新的回归范式来捕捉潜在的输出分布。MDN [82] 引入了用于回归的混合密度网络。最近,变换器[83]因其捕捉长程信息的能力而带来了显著的改进[36, 49, 71]。
Heatmaps. The heatmap representation [3,8,20,38,41,48[3,8,20,38,41,48, 54,62,69,88,92,98]54,62,69,88,92,98] has been dominant since its introduction [6,78,90][6,78,90] because of its strong localization and generalization ability. Many follow-up works have been devoted to continuously improving them, including proposing powerful networks [7,12,13,27,55,74][7,12,13,27,55,74] to estimate the heatmaps more accurately, introducing the attention operator to the models [40,72,97,103], reducing the quantization errors [29,105], fusion with the coordinate prediction-based methods [19,25, 60, 75], refining the results [22, 53, 73, 85], leveraging other tasks [33,57,59], and leveraging large unlabeled datasets [32, 93]. However, the heatmap representation suffers from quantization errors caused by the downsampling operations in neural networks. Besides, the joint dependency is not modeled by the heatmaps. 热图。热图表示法 [3,8,20,38,41,48[3,8,20,38,41,48 、 54,62,69,88,92,98]54,62,69,88,92,98] 自 [6,78,90][6,78,90] 问世以来,因其强大的定位和泛化能力而一直占据主导地位。许多后续工作致力于不断改进它们,包括提出功能强大的网络 [7,12,13,27,55,74][7,12,13,27,55,74] 以更准确地估计热图,在模型中引入注意力算子 [40,72,97,103],减少量化误差 [29,105],与基于坐标预测的方法融合 [19,25,60,75],完善结果 [22,53,73,85],利用其他任务 [33,57,59],以及利用大型无标记数据集 [32,93]。然而,热图表示法会受到神经网络降采样操作引起的量化误差的影响。此外,热图也无法模拟联合依赖关系。
Discrete bins. Recent works [10, 39, 47] propose to divide each pixel into several bins, allowing sub-pixel localization accuracy. The horizontal and vertical coordinates of each joint are separately quantized into discrete classes. Similar to our work, they also cast human pose estimation as a classification task. However, each coordinate of the pose is treated independently which differs from our structured representation. 离散分区。最近的研究[10, 39, 47]提出将每个像素分为若干个分区,从而实现亚像素定位精度。每个关节的水平和垂直坐标被分别量化为离散的类别。与我们的研究类似,他们也将人体姿态估计作为一项分类任务。不过,姿势的每个坐标都是独立处理的,这与我们的结构化表示不同。
2.2. Modeling joint dependency 2.2.联合依赖性建模
Since the human body has an articulated structure, there are many works trying to model joint dependency to help resolve low-level ambiguities. However, most of them focus on the modeling aspect rather than representation which is the focus of this work. 由于人体具有铰接式结构,因此有许多作品试图建立关节依赖性模型,以帮助解决低层次的模糊问题。不过,这些作品大多侧重于建模方面,而不是本作品的重点--表征。
Pictorial structures. Some works [2, 21, 63, 65, 100] propose to use the deformable model where the relationship between body joints is explicitly considered based on anatomy priors (e.g. limb lengths). However, they have three disadvantages. First, they usually make strong assumptions on the relationships, e.g. Gaussian distribution on the offsets between two joints, making them incapable to represent complex patterns. Second, they still require that the body joints can be independently detected from images first, and based on that they use the dependency priors to obtain the most plausible configuration. However, the first step is already very difficult in cluttered scenes with serious occlusions. Finally, they cannot be trained end-to-end with the deep networks with an exception [78] that needs to relax the formulation. 图像结构。一些研究[2, 21, 63, 65, 100]建议使用可变形模型,根据解剖先验(如肢体长度)明确考虑身体关节之间的关系。然而,它们有三个缺点。首先,它们通常会对这些关系做出强有力的假设,例如两个关节间偏移的高斯分布,这使它们无法表现复杂的模式。其次,它们仍然要求首先能从图像中独立检测出身体关节,然后在此基础上使用依赖性先验来获得最合理的配置。然而,在遮挡严重的杂乱场景中,第一步已经非常困难。最后,它们无法与深度网络进行端对端训练,只有一个例外[78],即需要放宽表述。
Implicit modeling. The recent deep learning-based methods [16, 64, 85, 99, 101, 106] implicitly model the dependency by propagating the visual features between the joints. For example, Chu et al. [16] introduce geometrical transform kernels to fuse the features of different channels which are believed to characterize different joints. Wang et al. [85] use Graph Convolutional Network to refine pose estimates 隐式建模。最近基于深度学习的方法[16, 64, 85, 99, 101, 106]通过在关节间传播视觉特征,隐式地建立了依赖关系模型。例如,Chu 等人[16] 引入几何变换核来融合不同通道的特征,这些通道被认为是不同关节的特征。Wang 等人[85] 使用图卷积网络来完善姿势估计
which are obtained by the heatmap-based methods first. In addition, Chen et al. [11] propose to learn a pose discriminator to exclude non-realistic pose estimates and push the predictor to learn poses with reasonable structures. Li et al. [40] explicitly learn a type embedding for each joint and apply the transformer to model the relationships among the joints. But from the aspect of representation, they still treat each joint independently and predict the heatmap for each joint. 而这些都是基于热图的方法首先获得的。此外,Chen 等人[11] 建议学习姿势判别器以排除非现实姿势估计,并推动预测器学习具有合理结构的姿势。Li 等人[40]明确地学习了每个关节的类型嵌入,并应用变换器对关节间的关系进行建模。但从表示的角度来看,他们仍将每个关节独立处理,并预测每个关节的热图。
Our PCT representation differs from the previous methods in three aspects. First, the joint dependency is encoded earlier in the representations by the tokens (changing the state of a token changes the corresponding sub-structure rather than a single joint). In contrast, the other three representations treat each joint independently. Second, the substructures are automatically learned from training data without making any unrealistic assumptions. We empirically show that it has a stronger capability to resolve ambiguities caused by occlusion in a variety of situations. Third, the joint dependency is explicitly imposed rather than by implicit feature propagation. The latter 1 od still allows unrealistic pose estimates in challenging 我们的 PCT 表示法在三个方面不同于之前的方法。首先,联合依赖性在表征中更早地由标记编码(改变标记的状态会改变相应的子结构,而不是单个关节)。与此相反,其他三种表示法都是独立处理每个关节的。其次,子结构是从训练数据中自动学习的,不做任何不切实际的假设。我们的经验表明,在各种情况下,它都有更强的能力来解决遮挡造成的歧义。第三,联合依赖关系是明确施加的,而不是通过隐式特征传播。后一种方法仍然会在具有挑战性的情况下产生不切实际的姿态估计。
3. Pose as Compositional Tokens 3.作为合成标记的姿势
In Section 3.1, we describe how to learn the codebook and the encoder/decoder networks. Section 3.2 explains how it is used in the human pose estimation task. 在第 3.1 节中,我们将介绍如何学习编码本和编码器/解码器网络。第 3.2 节将介绍如何将其用于人体姿态估计任务。
3.1. Learning compositional tokens 3.1.学习构图标记
We represent a raw pose as GinR^(K xx D)\mathbf{G} \in \mathbb{R}^{K \times D} where KK is the number of body joints and DD is the dimension of each joint, where D=2D=2 for 2D pose, and D=3D=3 for 3D pose, respectively. We learn a compositional encoder f_(e)(*)f_{e}(\cdot) to transform a pose into MM token features: 我们用 GinR^(K xx D)\mathbf{G} \in \mathbb{R}^{K \times D} 表示原始姿势,其中 KK 是身体关节的数量, DD 是每个关节的维度, D=2D=2 表示 2D 姿势, D=3D=3 表示 3D 姿势。我们学习合成编码器 f_(e)(*)f_{e}(\cdot) ,将姿势转换为 MM 标记特征:
where each token feature t_(i)inR^(H)\mathbf{t}_{i} \in \mathbb{R}^{H} approximately corresponds to a sub-structure of the pose which involves a few interdependent joints. Figure 1 shows some of the learned examples. Note that the representation has lots of redundancy because different tokens may have overlapping joints. The redundancy makes it robust to occlusions of individual parts. 其中每个标记特征 t_(i)inR^(H)\mathbf{t}_{i} \in \mathbb{R}^{H} 大约对应于姿势的一个子结构,该结构涉及几个相互依存的关节。图 1 显示了一些学习到的示例。请注意,由于不同的标记可能有重叠的关节,因此该表示法具有大量冗余。这种冗余使其对个别部分的遮挡具有鲁棒性。
Similar to [81], we define a latent embedding space by a codebook C=(c_(1),cdots,c_(V))^(T)inR^(V xx N)\mathbf{C}=\left(\mathbf{c}_{1}, \cdots, \mathbf{c}_{V}\right)^{\mathrm{T}} \in \mathbb{R}^{V \times N} where VV is the number of codebook entries. We quantize each token t_(i)\mathbf{t}_{i} by the nearest neighbor look-up using the embedding space as shown in the following equation: 与 [81] 类似,我们通过一个编码本 C=(c_(1),cdots,c_(V))^(T)inR^(V xx N)\mathbf{C}=\left(\mathbf{c}_{1}, \cdots, \mathbf{c}_{V}\right)^{\mathrm{T}} \in \mathbb{R}^{V \times N} 来定义一个潜在的嵌入空间,其中 VV 是编码本条目数。我们通过使用嵌入空间的最近邻查找对每个标记 t_(i)\mathbf{t}_{i} 进行量化,如下式所示:
Note that all tokens share the same embedding space C\mathbf{C} which simplifies training. 请注意,所有标记共享相同的嵌入空间 C\mathbf{C} ,这简化了训练。
We abuse q(t_(i))q\left(\mathbf{t}_{i}\right) to represent the index to the corresponding codebook entry. Then the quantized tokens (c_(q(t_(1))),c_(q(t_(2))),cdots,c_(q(t_(M))))\left(\mathbf{c}_{q\left(\mathbf{t}_{1}\right)}, \mathbf{c}_{q\left(\mathbf{t}_{2}\right)}, \cdots, \mathbf{c}_{q\left(\mathbf{t}_{M}\right)}\right) will be fed to the decoder network to recover the original pose: 我们滥用 q(t_(i))q\left(\mathbf{t}_{i}\right) 来表示对应编码本条目索引。然后,量化后的标记 (c_(q(t_(1))),c_(q(t_(2))),cdots,c_(q(t_(M))))\left(\mathbf{c}_{q\left(\mathbf{t}_{1}\right)}, \mathbf{c}_{q\left(\mathbf{t}_{2}\right)}, \cdots, \mathbf{c}_{q\left(\mathbf{t}_{M}\right)}\right) 将被输入解码器网络,以恢复原始姿势:
The network structure is similar to the encoder network in the reverse order except that we use a shallower MLP-Mixer network with only one block. 网络结构与编码器网络的顺序相反,但我们使用了一个较浅的 MLP-Mixer 网络,只有一个区块。
The encoder network, the codebook, and the decoder network are jointly learned by minimizing the following loss over the training dataset: 编码器网络、编码本和解码器网络是通过最小化训练数据集上的以下损失来共同学习的:
ℓ_(pct)=smooth_(L_(1))( hat(G),G)+betasum_(i=1)^(M)||t_(i)-sg[c_(q(t_(i)))]||_(2)^(2)\ell_{p c t}=\operatorname{smooth}_{L_{1}}(\hat{\mathbf{G}}, \mathbf{G})+\beta \sum_{i=1}^{M}\left\|\mathbf{t}_{i}-\operatorname{sg}\left[\mathbf{c}_{q\left(\mathbf{t}_{i}\right)}\right]\right\|_{2}^{2}
where, sg denotes stopping gradient, beta\beta is a hyperparameter. 其中,sg 表示停止梯度, beta\beta 是超参数。
We follow the optimization strategy used in [81] to handle the broken gradients issue in the quantization step and the codebook is updated using the exponential moving average of previous token features. In our implementation, we have two designs that improve the results. First, inspired by [26,94][26,94], we randomly mask some joints and require the model to reconstruct them. Second, we concatenate the image features around the joints with the positional features to enhance its discrimination ability. 我们采用 [81] 中使用的优化策略来处理量化步骤中的梯度破碎问题,并使用之前标记特征的指数移动平均值来更新编码本。在我们的实现过程中,有两种设计可以改善结果。首先,受 [26,94][26,94] 的启发,我们随机屏蔽了一些关节,并要求模型对其进行重建。其次,我们将关节周围的图像特征与位置特征串联起来,以增强其辨别能力。
Discussion. We try to explain why PCT learns tokens that correspond to meaningful sub-structures of poses. At one extreme, if each token corresponds to a single joint, then we need w xx hw \times h (i.e. 65536 for an image of size 256 xx256256 \times 256 ) codebook entries to achieve a small quantization error. But we only use 1024 entries in our experiments which is much smaller. This drives the model to learn larger structures than individual joints to improve the efficiency of the codebook. At another extreme, if we let a token correspond to an intact pose, then we only need one token instead of MM tokens. But in the worst case, it requires (wh)^(K)(w h)^{K} codebook entries in order to quantize the poses with a small error. In contrast, our method drives the model to divide a pose into multiple basic sub-structures whose possible configurations can be described by a shared set. 讨论。我们试图解释 PCT 为什么要学习与姿势的有意义子结构相对应的标记。从一个极端来看,如果每个标记对应一个关节,那么我们需要 w xx hw \times h (即对于大小为 256 xx256256 \times 256 的图像,需要 65536 个)的代码集条目才能实现较小的量化误差。但我们在实验中只使用了 1024 个条目,这要小得多。这就促使模型学习比单个关节更大的结构,以提高代码库的效率。在另一个极端,如果我们让一个标记对应一个完整的姿势,那么我们只需要一个标记,而不是 MM 标记。但在最坏的情况下,需要 (wh)^(K)(w h)^{K} 个编码词条才能量化出误差很小的姿势。相比之下,我们的方法会驱动模型将一个姿势划分为多个基本子结构,其可能的配置可以用一个共享集来描述。
Relation to VQ-VAE [81]. The PCT representation is inspired by VQ-VAE. The main difference is that VQ-VAE treats well-defined regular data, e.g. image patches with the resolution of 16 xx1616 \times 16, as tokens. However, for human poses, we require PCT to automatically learn meaningful sub-structures as tokens, which is realized by the compositional encoder as well as the codebook sharing scheme. Besides, the network structures of the encoder and decoder are particularly designed for human poses, different from VQ-VAE. 与 VQ-VAE 的关系 [81]。PCT 表示法受到 VQ-VAE 的启发。其主要区别在于,VQ-VAE 将定义明确的常规数据(例如分辨率为 16 xx1616 \times 16 的图像斑块)视为标记。但是,对于人体姿势,我们要求 PCT 自动学习有意义的子结构作为标记,这可以通过组合编码器和编码本共享方案来实现。此外,编码器和解码器的网络结构是专门为人类姿势设计的,与 VQ-VAE 不同。
3.2. Human Pose Estimation 3.2.人体姿态估计
With the learned codebook and the decoder, we cast human pose estimation as a classification task. As shown in Figure 3, given a cropped input image I, we simply predict the categories of the MM tokens, which are fed to the decoder to recover the pose. We use backbone for extracting image features X\mathbf{X} and design the following classification head. 有了学习到的编码本和解码器,我们就可以将人体姿态估计作为一项分类任务。如图 3 所示,给定一张裁剪过的输入图像 I,我们只需预测 MM 标记的类别,并将其输入解码器以恢复姿势。我们使用 backbone 提取图像特征 X\mathbf{X} ,并设计以下分类头。
Classification head. We first use two basic residual convolution blocks [28] to modulate the backbone features. Then, we flatten the features and change their dimension by a linear projection layer: 分类头我们首先使用两个基本的残差卷积块[28]来调制骨干特征。然后,我们通过线性投影层对特征进行扁平化处理并改变其维度:
where C\mathcal{C} and L\mathcal{L} represent the feature modulator and the linear projection respectively. We reshape the onedimensional output feature into a matrix X_(f)inR^(M xx N)\mathbf{X}_{f} \in \mathbb{R}^{M \times N}, use four MLP-Mixer blocks [77] to process the features, and output the logits of token classification: 其中 C\mathcal{C} 和 L\mathcal{L} 分别代表特征调制器和线性投影。我们将一维输出特征重塑为矩阵 X_(f)inR^(M xx N)\mathbf{X}_{f} \in \mathbb{R}^{M \times N} ,使用四个 MLP-Mixer 模块 [77] 处理特征,并输出标记分类的对数:
where hat(L)\hat{\mathbf{L}} has the shape of R^(M xx V)\mathbb{R}^{M \times V}. 其中 hat(L)\hat{\mathbf{L}} 的形状为 R^(M xx V)\mathbb{R}^{M \times V} 。
Training. We use two losses to train the classification head. First, we enforce the cross entropy loss: 训练。我们使用两种损失来训练分类头。首先,我们使用交叉熵损失:
ℓ_(cls)=CE( hat(L),L)\ell_{c l s}=\operatorname{CE}(\hat{\mathbf{L}}, \mathbf{L})
where L\mathbf{L} denotes the ground-truth token classes obtained by feeding the ground-truth poses into the encoder. 其中, L\mathbf{L} 表示将地面真实姿势输入编码器后得到的地面真实标记类别。
We also enforce a pose reconstruction loss, which minimizes the difference between the predicted and the groundtruth poses. To allow the gradients from the decoder network to flow back to the classification head, we replace the hard inference scheme with a soft version: 我们还强制执行姿态重构损失,使预测姿态与地面实况姿态之间的差异最小化。为了让解码器网络的梯度流回分类头,我们用软版本取代了硬推理方案:
where SinR^(M xx N)\mathbf{S} \in \mathbb{R}^{M \times N} denotes the linearly interpolated token features. The token features S\mathbf{S} are then fed to the prelearned decoder to obtain the predicted pose hat(G)\hat{\mathbf{G}}. The complete loss function is: 其中 SinR^(M xx N)\mathbf{S} \in \mathbb{R}^{M \times N} 表示线性插值的标记特征。然后将标记特征 S\mathbf{S} 输入预学习解码器,以获得预测姿势 hat(G)\hat{\mathbf{G}} 。完整的损失函数为
Note that the decoder network is not updated during training. 请注意,解码器网络在训练过程中不会更新。
4. Experiments 4.实验
We first extensively evaluate the PCT representation on five benchmark datasets in the context of 2D human pose estimation. Then we present the 3D pose estimation results and compare them to the state-of-the-art methods. Ablation studies about the main components of our method are also provided to help understand the approach. 我们首先以二维人体姿态估计为背景,在五个基准数据集上对 PCT 表示法进行了广泛评估。然后,我们介绍三维姿态估计结果,并将其与最先进的方法进行比较。我们还提供了有关我们方法主要组成部分的消融研究,以帮助理解我们的方法。
4.1. Datasets and metrics 4.1.数据集和指标
2D pose datasets. First, we conduct experiments on the COCO [42] and MPII [1] datasets. The COCO dataset has 150 K labeled human instances for training, 5 K images for validation, and 30 K30 K images for testing. The MPII dataset has 40 K labeled human instances performing a variety of activities. Second, we evaluate our method on four datasets that have severe occlusions, including the test set of the CrowdPose [35] dataset, the validation and test sets of the OCHuman [107] dataset, and the SyncOCC [108] dataset. In CrowdPose [35] and OCHuman [107], the occluded joints are manually labeled by annotators. The SyncOCC [108] dataset is a synthetic dataset generated by UnrealCV [91] so it provides accurate locations of the occluded joints. We directly apply the model trained on the COCO dataset to the four datasets without re-training. We report the results on the occluded joints to validate the capability of the model to handle occlusion. 二维姿势数据集。首先,我们在 COCO [42] 和 MPII [1] 数据集上进行了实验。COCO 数据集有 150 K 个用于训练的标注人体实例、5 K 张用于验证的图像和 30 K30 K 张用于测试的图像。MPII 数据集有 40 K 个进行各种活动的标注人类实例。其次,我们在四个有严重遮挡的数据集上评估了我们的方法,包括 CrowdPose [35] 数据集的测试集、OCHuman [107] 数据集的验证集和测试集以及 SyncOCC [108] 数据集。在 CrowdPose [35] 和 OCHuman [107] 数据集中,闭塞关节是由标注者手动标注的。SyncOCC [108] 数据集是由 UnrealCV [91] 生成的合成数据集,因此能提供闭塞关节的准确位置。我们将在 COCO 数据集上训练的模型直接应用于这四个数据集,而无需重新训练。我们报告了闭塞关节的结果,以验证模型处理闭塞的能力。
3D pose datasets. We conduct experiments on the Human3.6M [30] dataset which has 11 human subjects performing daily actions. We follow the practice of the previous works such as [17]. In particular, five subjects (S1, S5, 三维姿势数据集。我们在Human3.6M[30]数据集上进行了实验,该数据集包含 11 个正在进行日常动作的人类受试者。我们沿用了 [17] 等前人的做法。其中,五个实验对象(S1, S5、
S6, S7, S8) are used for training, and two subjects (S9, S11) are used for testing. Since there are no labels for joint occlusion, we only compare our method to the state-of-the-art methods to validate the general applicability of the representation to both 2D and 3D poses. S6, S7, S8)用于训练,两个实验对象(S9, S11)用于测试。由于没有关节咬合的标签,我们只将我们的方法与最先进的方法进行比较,以验证该表示法对二维和三维姿势的普遍适用性。
Evaluation metrics. We follow the standard evaluation metrics for the COCO [42], MPII [1] and, Human3.6M [30] datasets. In particular, the OKS-based AP (average precision), AP^(50)\mathrm{AP}^{50} and AP^(75)\mathrm{AP}^{75} are reported for the COCO dataset. The PCKh (head-normalized probability of correct keypoint) score is used for the MPII dataset. The MPJPE (mean per joint position error) are used for Human3.6M. On the four occlusion datasets, we report the AP^(OC)\mathrm{AP}^{O C} based on OKS computed only on the occluded joints. 评估指标。我们采用 COCO [42]、MPII [1] 和 Human3.6M [30] 数据集的标准评估指标。其中,COCO 数据集报告了基于 OKS 的 AP(平均精度)、 AP^(50)\mathrm{AP}^{50} 和 AP^(75)\mathrm{AP}^{75} 。MPII 数据集采用 PCKh(头部归一化关键点正确概率)得分。Human3.6M数据集使用的是MPJPE(平均每个关节位置误差)。在四个闭塞数据集上,我们根据仅在闭塞关节上计算的 OKS 报告 AP^(OC)\mathrm{AP}^{O C} 。
4.2. Implementation details 4.2.实施细节
We adopt the top-down estimation pipeline. In training, we use the GT boxes provided by the datasets. In testing, we use the detection results provided by [92] for COCO, and the GT boxes for MPII and the occlusion datasets following the common practice. 我们采用自上而下的估算方法。在训练中,我们使用数据集提供的 GT 框。在测试中,我们使用 [92] 提供的 COCO 检测结果,并按照惯例使用 MPII 和闭塞数据集的 GT 框。
We use the Swin Transformer V2 [44,45] backbone pretrained with SimMIM [94] on ImageNet-1k [66]. It is also trained on the COCO dataset with heatmap supervision. To save computation cost, we fix the backbone and only train the classification head. We set the base learning rate, weight decay and batch size to 8e-4,0.058 e-4,0.05