Human Pose as Compositional Tokens
作为合成标记的人类姿势

Zigang Geng $^{1, 3}$ , Chunyu Wang $^{3 *}$ , Yixuan ${Wei}^{2, 3}$ , Ze Liu $^{1, 3}$ , Houqiang ${Li}^{1}$ , Han ${Hu}^{3 *}$
耿子刚 $^{1, 3}$ 、王春雨 $^{3 *}$ 、易璇 ${Wei}^{2, 3}$ 、刘泽 $^{1, 3}$ 、侯强 ${Li}^{1}$ 、韩啸 ${Hu}^{3 *}$ $^{1}$ University of Science and Technology of China $^{2}$ Tsinghua University $^{3}$ Microsoft Research Asia
$^{1}$ 中国科学技术大学 $^{2}$ 清华大学 $^{3}$ 微软亚洲研究院https://sites.google.com/view/pctpose

Abstract 摘要

Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling between the body joints. In this paper, we present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency. It represents a pose by $M$ discrete tokens with each characterizing a sub-structure with several interdependent joints (see Figure 1). The compositional design enables it to achieve a small reconstruction error at a low cost. Then we cast pose estimation as a classification task. In particular, we learn a classifier to predict the categories of the $M$ tokens from an image. A prelearned decoder network is used to recover the pose from the tokens without further post-processing. We show that it achieves better or comparable pose estimation results as the existing methods in general scenarios, yet continues to work well when occlusion occurs, which is ubiquitous in practice. The code and models are publicly available at https://github.com/Gengzigang/PCT.
人体姿态通常由身体关节的坐标向量或其热图嵌入表示。这种方法虽然便于数据处理，但由于缺乏身体关节间的依赖建模，因此会产生不切实际的姿势估计。在本文中，我们提出了一种名为 "姿势作为合成标记"（Pose as Compositional Tokens，PCT）的结构化表示法来探索关节间的依赖关系。它通过 $M$ 个离散标记来表示姿势，每个标记表示一个具有多个相互依赖关节的子结构（见图 1）。这种组合设计使其能够以较低的成本实现较小的重构误差。然后，我们将姿势估计作为一项分类任务。具体来说，我们学习一个分类器来预测图像中 $M$ 标记的类别。预学习解码器网络用于从标记恢复姿势，而无需进一步的后处理。我们的研究表明，在一般情况下，它能获得与现有方法更好或相当的姿态估计结果，但在发生遮挡时仍能保持良好的工作状态，而这在实践中是无处不在的。代码和模型可在https://github.com/Gengzigang/PCT上公开获取。

1. Introduction 1.导言

Human pose estimation is a fundamental task in computer vision which aims to estimate the positions of body joints from images. The recent progress has focused on network structures [74, 87, 96], training methods [31, 68, 93], and fusion strategies

[14, 15, 61, 67, 84, 102]

, which have notably advanced the accuracy on public datasets. However, it remains an open problem in challenging scenarios, e.g., in the presence of occlusion, which hinders its application in practice.
人体姿态估计是计算机视觉领域的一项基本任务，其目的是从图像中估计人体关节的位置。最近的进展主要集中在网络结构 [74, 87, 96]、训练方法 [31, 68, 93] 和融合策略

[14, 15, 61, 67, 84, 102]

，这些方法显著提高了公共数据集的准确性。然而，在具有挑战性的场景中，例如在有遮挡的情况下，这仍然是一个有待解决的问题，这阻碍了它在实际中的应用。

Current 2/3D pose estimators usually represent a pose by a coordinate vector

[23, 34, 79, 110]

or its heatmap embeddings [40, 55, 60, 74, 75, 80, 87, 90]. In both representations, the joints are treated independently, ignoring the fact that the body joints can serve as mutual context to each
当前的 2/3D 姿势估计器通常用坐标向量

[23, 34, 79, 110]

或其热图嵌入来表示姿势 [40, 55, 60, 74, 75, 80, 87, 90]。在这两种表示方法中，关节都是独立处理的，忽略了身体关节可以作为每个关节的相互上下文的事实。

Figure 1. Our approach represents a pose by M discrete tokens which are indices to the codebook entries (top). Each token is learned to represent a sub-structure. In each row, we show that if we change the state of one token to different values, it consistently changes the same sub-structure highlighted by orange. The black poses are before changing (bottom).
图 1我们的方法通过 M 个离散标记来表示一个姿势，这些标记是代码库条目（上图）的索引。每个标记通过学习代表一个子结构。在每一行中，我们可以看到，如果我们将一个标记的状态改变为不同的值，就会持续改变橙色标注的相同子结构。黑色部分为改变前的状态（下图）。
other. As a result, they may get unrealistic estimates when occlusion occurs as shown in Figure 2 (top). However, it is interesting to note that humans can easily predict intact poses from only the visible joints and the visual features. This is probably because people are able to use context to aid recognition as evidenced by some psychology experiments [5,58]. Some works attempt to introduce a tree or graph structure [2,21, 65, 85] to model joint dependency. However, the hand-designed rules usually make unrealistic assumptions on the relationships, making 三n incapable to represent complex patterns.
因此，如图 2（上图）所示，当发生闭塞时，他们可能会得到不切实际的估计值。因此，如图 2（上图）所示，当发生遮挡时，他们可能会得到不切实际的估计。不过，值得注意的是，人类可以仅从可见的关节和视觉特征轻松预测完整的姿势。这可能是因为人们能够利用上下文来帮助识别，一些心理学实验也证明了这一点 [5,58]。一些研究试图引入树状或图状结构 [2,21, 65, 85] 来模拟关节依赖关系。然而，手工设计的规则通常会对关系做出不切实际的假设，从而使三元法无法表示复杂的模式。

In this work, we hope to learn the dependency between the joints earlier in the representation stage without any assumptions. Our initial idea is to learn a set of prototype poses that are realistic, and represent every pose by the nearest prototype. While it can guarantee that all poses are realistic, it requires a large number of prototypes to reduce the quantization error to a reasonable level which is computa-
在这项工作中，我们希望能在不做任何假设的情况下，在表示阶段的早期学习关节之间的依赖关系。我们最初的想法是学习一组逼真的姿势原型，并用最接近的原型来表示每个姿势。虽然这能保证所有姿势都是真实的，但它需要大量的原型来将量化误差降低到合理水平，而量化误差是需要计算的。

Figure 2. Heatmap-based method (top) v.s. our PCT method (bottom) in occluded scenes. PCT predicts reasonable poses even under severe occlusion. The images are from COCO val2017.
图 2.基于热图的方法（上图）与我们的 PCT 方法（下图）在遮挡场景中的对比。即使在严重遮挡的情况下，PCT 也能预测出合理的姿势。图片来自 COCO val2017。
tionally infeasible. Instead, we propose a discrete representation, named pose as compositional tokens (PCT). Figure 3 shows the two stages of the representation. In Stage I, we learn a compositional encoder to transform a pose into

M

token features, with each encoding a sub-structure of the pose. See Figure 1 for some examples. Then the tokens are quantized by a shared codebook. So, a pose is simply represented by

M

discrete indices. The space represented by the codebook is sufficiently large to represent all poses accurately. We jointly learn the encoder, the codebook, and the decoder by minimizing a reconstruction error.
从理论上讲是不可行的。相反，我们提出了一种离散的表示方法，命名为 "构成标记姿势"（PCT）。图 3 显示了表示的两个阶段。在第一阶段，我们学习合成编码器，将姿势转换为

M

标记特征，每个标记特征编码姿势的一个子结构。一些例子见图 1。然后，这些标记通过共享编码本进行量化。因此，一个姿势可以简单地用

M

离散指数来表示。编码本所代表的空间足够大，可以准确地代表所有姿势。我们通过最小化重构误差来共同学习编码器、编码本和解码器。

In Stage II, we cast human pose estimation as a classification task. Given an image, we predict the categories of the

M

tokens, from which the pose is recovered by the decoder network. The PCT representation has several advantages. First, the dependency between the joints is modeled by the tokens, which helps to reduce the chance of getting unrealistic pose estimates. In particular, we see evidence that it has the potential to obtain reasonable estimates even when a large portion of the body is occluded. See Figure 2 (bottom) for some examples. Second, it does not require any expensive post-processing modules such as UDP [29] which is required by the heatmap representation to reduce the quantization errors. Third, it provides a unified representation for 2 D and 3 D poses. In addition, the discrete representation potentially facilitates its interactions with other discrete modalities such as text and speech. But this is not the focus of this work.
在第二阶段，我们将人体姿态估计作为一项分类任务。给定图像后，我们预测

M

标记的类别，解码器网络从中恢复姿势。PCT 表示法有几个优点。首先，关节之间的依赖关系由标记建模，这有助于减少获得不切实际的姿势估计值的几率。特别是，我们看到有证据表明，即使身体的大部分被遮挡，它也有可能获得合理的估计值。一些例子见图 2（底部）。其次，它不需要任何昂贵的后处理模块，如 UDP [29]，而热图表示法需要 UDP 来减少量化误差。第三，它为 2 D 和 3 D 姿态提供了统一的表示方法。此外，离散表示可能会促进它与文本和语音等其他离散模态的交互。但这并不是这项工作的重点。

We extensively evaluate our approach in 2D human pose estimation on five benchmark datasets. It gets better or comparable accuracy as the state-of-the-art methods on all of them. But more importantly, it achieves significantly better results when evaluated only on the occluded joints, validating the advantages of its dependency modeling capability. We also present the results in 3D pose estimation on the H36M dataset on which it achieves comparable accuracy
我们在五个基准数据集上广泛评估了我们的二维人体姿态估计方法。在所有这些数据集上，我们的方法都获得了比最先进方法更好或相当的精度。但更重要的是，当只对隐蔽关节进行评估时，它取得了明显更好的结果，验证了其依赖建模能力的优势。我们还介绍了在 H36M 数据集上进行三维姿态估计的结果，它在该数据集上取得了相当高的精度。
as the state-of-the-art methods using a simple architecture. The results demonstrate that it has wide applicability.
与最先进的方法相比，它采用了简单的结构。结果表明，它具有广泛的适用性。

In this section, we first briefly discuss the widely used pose representations. Then we discuss the methods that explore joint dependencies.
在本节中，我们首先简要讨论广泛使用的姿势表示法。然后，我们讨论探索联合依赖关系的方法。

2.1. Pose representations
2.1.姿势表示法

Coordinates. Early works [4, 9, 50, 56, 76, 79, 110] propose to directly regress the coordinates of body joints from images. While efficient, the accuracy is worse than the heatmap-based methods because it is challenging to learn the highly non-linear mapping. Some works [23, 89] propose to improve them by focusing on local features around the joints. Residual Log-likelihood Estimation [34] proposes a novel regression paradigm to capture the underlying output distribution. MDN [82] introduces mixture density network for regression. Recently, transformer [83] brings notable improvement [36, 49, 71] due to its ability to capture long-range information.
坐标早期的研究 [4, 9, 50, 56, 76, 79, 110] 建议直接从图像中回归身体关节的坐标。这种方法虽然高效，但准确性却不如基于热图的方法，因为学习高度非线性的映射具有挑战性。一些研究 [23, 89] 建议通过关注关节周围的局部特征来改进这些方法。Residual Log-likelihood Estimation [34] 提出了一种新的回归范式来捕捉潜在的输出分布。MDN [82] 引入了用于回归的混合密度网络。最近，变换器[83]因其捕捉长程信息的能力而带来了显著的改进[36, 49, 71]。
Heatmaps. The heatmap representation

[3, 8, 20, 38, 41, 48

54, 62, 69, 88, 92, 98]

has been dominant since its introduction

[6, 78, 90]

because of its strong localization and generalization ability. Many follow-up works have been devoted to continuously improving them, including proposing powerful networks

[7, 12, 13, 27, 55, 74]

to estimate the heatmaps more accurately, introducing the attention operator to the models [40,72,97,103], reducing the quantization errors [29,105], fusion with the coordinate prediction-based methods [19,25, 60, 75], refining the results [22, 53, 73, 85], leveraging other tasks [33,57,59], and leveraging large unlabeled datasets [32, 93]. However, the heatmap representation suffers from quantization errors caused by the downsampling operations in neural networks. Besides, the joint dependency is not modeled by the heatmaps.
热图。热图表示法

[3, 8, 20, 38, 41, 48

、

54, 62, 69, 88, 92, 98]

自

[6, 78, 90]

问世以来，因其强大的定位和泛化能力而一直占据主导地位。许多后续工作致力于不断改进它们，包括提出功能强大的网络

[7, 12, 13, 27, 55, 74]

以更准确地估计热图，在模型中引入注意力算子 [40,72,97,103]，减少量化误差 [29,105]，与基于坐标预测的方法融合 [19,25,60,75]，完善结果 [22,53,73,85]，利用其他任务 [33,57,59]，以及利用大型无标记数据集 [32,93]。然而，热图表示法会受到神经网络降采样操作引起的量化误差的影响。此外，热图也无法模拟联合依赖关系。

Figure 3. Two stages of the PCT representation (a,b) and the structure of the compositional encoder ©. In Stage I, we learn a compositional encoder to transform a pose into

M

tokens which are quantized by a codebook. So, a pose is represented by a set of discrete indices to the codebook. In Stage II, we cast pose estimation as a classification task by predicting the categories of the

M

tokens, i.e. the indices to the codebook entries. They will be decoded by a decoder network to obtain the final pose.
图 3.PCT 表示法的两个阶段（a,b）和组合编码器的结构©。在第一阶段，我们学习合成编码器，将姿势转换为

M

标记，并通过编码本进行量化。因此，一个姿势由编码本的一组离散索引表示。在第二阶段，我们通过预测

M

标记的类别（即代码集条目索引），将姿势估计作为一项分类任务。它们将被解码器网络解码，以获得最终姿势。

Discrete bins. Recent works [10, 39, 47] propose to divide each pixel into several bins, allowing sub-pixel localization accuracy. The horizontal and vertical coordinates of each joint are separately quantized into discrete classes. Similar to our work, they also cast human pose estimation as a classification task. However, each coordinate of the pose is treated independently which differs from our structured representation.
离散分区。最近的研究[10, 39, 47]提出将每个像素分为若干个分区，从而实现亚像素定位精度。每个关节的水平和垂直坐标被分别量化为离散的类别。与我们的研究类似，他们也将人体姿态估计作为一项分类任务。不过，姿势的每个坐标都是独立处理的，这与我们的结构化表示不同。

2.2. Modeling joint dependency
2.2.联合依赖性建模

Since the human body has an articulated structure, there are many works trying to model joint dependency to help resolve low-level ambiguities. However, most of them focus on the modeling aspect rather than representation which is the focus of this work.
由于人体具有铰接式结构，因此有许多作品试图建立关节依赖性模型，以帮助解决低层次的模糊问题。不过，这些作品大多侧重于建模方面，而不是本作品的重点--表征。
Pictorial structures. Some works [2, 21, 63, 65, 100] propose to use the deformable model where the relationship between body joints is explicitly considered based on anatomy priors (e.g. limb lengths). However, they have three disadvantages. First, they usually make strong assumptions on the relationships, e.g. Gaussian distribution on the offsets between two joints, making them incapable to represent complex patterns. Second, they still require that the body joints can be independently detected from images first, and based on that they use the dependency priors to obtain the most plausible configuration. However, the first step is already very difficult in cluttered scenes with serious occlusions. Finally, they cannot be trained end-to-end with the deep networks with an exception [78] that needs to relax the formulation.
图像结构。一些研究[2, 21, 63, 65, 100]建议使用可变形模型，根据解剖先验（如肢体长度）明确考虑身体关节之间的关系。然而，它们有三个缺点。首先，它们通常会对这些关系做出强有力的假设，例如两个关节间偏移的高斯分布，这使它们无法表现复杂的模式。其次，它们仍然要求首先能从图像中独立检测出身体关节，然后在此基础上使用依赖性先验来获得最合理的配置。然而，在遮挡严重的杂乱场景中，第一步已经非常困难。最后，它们无法与深度网络进行端对端训练，只有一个例外[78]，即需要放宽表述。
Implicit modeling. The recent deep learning-based methods [16, 64, 85, 99, 101, 106] implicitly model the dependency by propagating the visual features between the joints. For example, Chu et al. [16] introduce geometrical transform kernels to fuse the features of different channels which are believed to characterize different joints. Wang et al. [85] use Graph Convolutional Network to refine pose estimates
隐式建模。最近基于深度学习的方法[16, 64, 85, 99, 101, 106]通过在关节间传播视觉特征，隐式地建立了依赖关系模型。例如，Chu 等人[16] 引入几何变换核来融合不同通道的特征，这些通道被认为是不同关节的特征。Wang 等人[85] 使用图卷积网络来完善姿势估计
which are obtained by the heatmap-based methods first. In addition, Chen et al. [11] propose to learn a pose discriminator to exclude non-realistic pose estimates and push the predictor to learn poses with reasonable structures. Li et al. [40] explicitly learn a type embedding for each joint and apply the transformer to model the relationships among the joints. But from the aspect of representation, they still treat each joint independently and predict the heatmap for each joint.
而这些都是基于热图的方法首先获得的。此外，Chen 等人[11] 建议学习姿势判别器以排除非现实姿势估计，并推动预测器学习具有合理结构的姿势。Li 等人[40]明确地学习了每个关节的类型嵌入，并应用变换器对关节间的关系进行建模。但从表示的角度来看，他们仍将每个关节独立处理，并预测每个关节的热图。

Our PCT representation differs from the previous methods in three aspects. First, the joint dependency is encoded earlier in the representations by the tokens (changing the state of a token changes the corresponding sub-structure rather than a single joint). In contrast, the other three representations treat each joint independently. Second, the substructures are automatically learned from training data without making any unrealistic assumptions. We empirically show that it has a stronger capability to resolve ambiguities caused by occlusion in a variety of situations. Third, the joint dependency is explicitly imposed rather than by implicit feature propagation. The latter 1 od still allows unrealistic pose estimates in challenging
我们的 PCT 表示法在三个方面不同于之前的方法。首先，联合依赖性在表征中更早地由标记编码（改变标记的状态会改变相应的子结构，而不是单个关节）。与此相反，其他三种表示法都是独立处理每个关节的。其次，子结构是从训练数据中自动学习的，不做任何不切实际的假设。我们的经验表明，在各种情况下，它都有更强的能力来解决遮挡造成的歧义。第三，联合依赖关系是明确施加的，而不是通过隐式特征传播。后一种方法仍然会在具有挑战性的情况下产生不切实际的姿态估计。

3. Pose as Compositional Tokens
3.作为合成标记的姿势

In Section 3.1, we describe how to learn the codebook and the encoder/decoder networks. Section 3.2 explains how it is used in the human pose estimation task.
在第 3.1 节中，我们将介绍如何学习编码本和编码器/解码器网络。第 3.2 节将介绍如何将其用于人体姿态估计任务。

3.1. Learning compositional tokens
3.1.学习构图标记

We represent a raw pose as

G \in R^{K \times D}

where

K

is the number of body joints and

D

is the dimension of each joint, where

D = 2

for 2D pose, and

D = 3

for 3D pose, respectively. We learn a compositional encoder

f_{e} (\cdot)

to transform a pose into

M

token features:
我们用

G \in R^{K \times D}

表示原始姿势，其中

K

是身体关节的数量，

D

是每个关节的维度，

D = 2

表示 2D 姿势，

D = 3

表示 3D 姿势。我们学习合成编码器

f_{e} (\cdot)

，将姿势转换为

M

标记特征：

T = (t_{1}, t_{2}, \dots, t_{M}) = f_{e} (G)

where each token feature

t_{i} \in R^{H}

approximately corresponds to a sub-structure of the pose which involves a few interdependent joints. Figure 1 shows some of the learned examples. Note that the representation has lots of redundancy because different tokens may have overlapping joints. The redundancy makes it robust to occlusions of individual parts.
其中每个标记特征

t_{i} \in R^{H}

大约对应于姿势的一个子结构，该结构涉及几个相互依存的关节。图 1 显示了一些学习到的示例。请注意，由于不同的标记可能有重叠的关节，因此该表示法具有大量冗余。这种冗余使其对个别部分的遮挡具有鲁棒性。

Figure 3 © shows the network structure of the encoder. The position of each body joint is first fed to a linear projection layer to increase the feature dimension. Then the features are fed to a series of MLP-Mixer [77] blocks to deeply fuse the features of different joints. Finally, we extract

M

token features by applying a linear projection to the features across all of the joints.
图 3 © 显示了编码器的网络结构。每个身体关节的位置首先被送入线性投影层，以增加特征维度。然后，将这些特征输入一系列 MLP-Mixer [77] 模块，对不同关节的特征进行深度融合。最后，我们通过对所有关节的特征进行线性投影来提取

M

标记特征。

Similar to [81], we define a latent embedding space by a codebook

C = {(c_{1}, \dots, c_{V})}^{T} \in R^{V \times N}

where

V

is the number of codebook entries. We quantize each token

t_{i}

by the nearest neighbor look-up using the embedding space as shown in the following equation:
与 [81] 类似，我们通过一个编码本

C = {(c_{1}, \dots, c_{V})}^{T} \in R^{V \times N}

来定义一个潜在的嵌入空间，其中

V

是编码本条目数。我们通过使用嵌入空间的最近邻查找对每个标记

t_{i}

进行量化，如下式所示：

q (t_{i} = v ∣ G) = {\begin{cases} 1 & if v = \underset{j}{\arg min} {‖ t_{i} - c_{j} ‖}_{2} \\ 0 & otherwise \end{cases}

Note that all tokens share the same embedding space

C

which simplifies training.
请注意，所有标记共享相同的嵌入空间

C

，这简化了训练。

We abuse

q (t_{i})

to represent the index to the corresponding codebook entry. Then the quantized tokens

(c_{q (t_{1})}, c_{q (t_{2})}, \dots, c_{q (t_{M})})

will be fed to the decoder network to recover the original pose:
我们滥用

q (t_{i})

来表示对应编码本条目索引。然后，量化后的标记

(c_{q (t_{1})}, c_{q (t_{2})}, \dots, c_{q (t_{M})})

将被输入解码器网络，以恢复原始姿势：

\hat{G} = f_{d} (c_{q (t_{1})}, c_{q (t_{2})}, \dots, c_{q (t_{M})})

The network structure is similar to the encoder network in the reverse order except that we use a shallower MLP-Mixer network with only one block.
网络结构与编码器网络的顺序相反，但我们使用了一个较浅的 MLP-Mixer 网络，只有一个区块。

The encoder network, the codebook, and the decoder network are jointly learned by minimizing the following loss over the training dataset:
编码器网络、编码本和解码器网络是通过最小化训练数据集上的以下损失来共同学习的：

ℓ_{p c t} = {smooth}_{L_{1}} (\hat{G}, G) + β \sum_{i = 1}^{M} {‖ t_{i} - sg [c_{q (t_{i})}] ‖}_{2}^{2}

where, sg denotes stopping gradient,

β

is a hyperparameter.
其中，sg 表示停止梯度，

β

是超参数。
We follow the optimization strategy used in [81] to handle the broken gradients issue in the quantization step and the codebook is updated using the exponential moving average of previous token features. In our implementation, we have two designs that improve the results. First, inspired by

[26, 94]

, we randomly mask some joints and require the model to reconstruct them. Second, we concatenate the image features around the joints with the positional features to enhance its discrimination ability.
我们采用 [81] 中使用的优化策略来处理量化步骤中的梯度破碎问题，并使用之前标记特征的指数移动平均值来更新编码本。在我们的实现过程中，有两种设计可以改善结果。首先，受

[26, 94]

的启发，我们随机屏蔽了一些关节，并要求模型对其进行重建。其次，我们将关节周围的图像特征与位置特征串联起来，以增强其辨别能力。

Discussion. We try to explain why PCT learns tokens that correspond to meaningful sub-structures of poses. At one extreme, if each token corresponds to a single joint, then we need

w \times h

(i.e. 65536 for an image of size

256 \times 256

) codebook entries to achieve a small quantization error. But we only use 1024 entries in our experiments which is much smaller. This drives the model to learn larger structures than individual joints to improve the efficiency of the codebook. At another extreme, if we let a token correspond to an intact pose, then we only need one token instead of

M

tokens. But in the worst case, it requires

(w h)^{K}

codebook entries in order to quantize the poses with a small error. In contrast, our method drives the model to divide a pose into multiple basic sub-structures whose possible configurations can be described by a shared set.
讨论。我们试图解释 PCT 为什么要学习与姿势的有意义子结构相对应的标记。从一个极端来看，如果每个标记对应一个关节，那么我们需要

w \times h

（即对于大小为

256 \times 256

的图像，需要 65536 个）的代码集条目才能实现较小的量化误差。但我们在实验中只使用了 1024 个条目，这要小得多。这就促使模型学习比单个关节更大的结构，以提高代码库的效率。在另一个极端，如果我们让一个标记对应一个完整的姿势，那么我们只需要一个标记，而不是

M

标记。但在最坏的情况下，需要

(w h)^{K}

个编码词条才能量化出误差很小的姿势。相比之下，我们的方法会驱动模型将一个姿势划分为多个基本子结构，其可能的配置可以用一个共享集来描述。

Relation to VQ-VAE [81]. The PCT representation is inspired by VQ-VAE. The main difference is that VQ-VAE treats well-defined regular data, e.g. image patches with the resolution of

16 \times 16

, as tokens. However, for human poses, we require PCT to automatically learn meaningful sub-structures as tokens, which is realized by the compositional encoder as well as the codebook sharing scheme. Besides, the network structures of the encoder and decoder are particularly designed for human poses, different from VQ-VAE.
与 VQ-VAE 的关系 [81]。PCT 表示法受到 VQ-VAE 的启发。其主要区别在于，VQ-VAE 将定义明确的常规数据（例如分辨率为

16 \times 16

的图像斑块）视为标记。但是，对于人体姿势，我们要求 PCT 自动学习有意义的子结构作为标记，这可以通过组合编码器和编码本共享方案来实现。此外，编码器和解码器的网络结构是专门为人类姿势设计的，与 VQ-VAE 不同。

3.2. Human Pose Estimation
3.2.人体姿态估计

With the learned codebook and the decoder, we cast human pose estimation as a classification task. As shown in Figure 3, given a cropped input image I, we simply predict the categories of the

M

tokens, which are fed to the decoder to recover the pose. We use backbone for extracting image features

X

and design the following classification head.
有了学习到的编码本和解码器，我们就可以将人体姿态估计作为一项分类任务。如图 3 所示，给定一张裁剪过的输入图像 I，我们只需预测

M

标记的类别，并将其输入解码器以恢复姿势。我们使用 backbone 提取图像特征

X

，并设计以下分类头。

Classification head. We first use two basic residual convolution blocks [28] to modulate the backbone features. Then, we flatten the features and change their dimension by a linear projection layer:
分类头我们首先使用两个基本的残差卷积块[28]来调制骨干特征。然后，我们通过线性投影层对特征进行扁平化处理并改变其维度：

X_{f} = L (Flatten (C (X))),

where

C

and

L

represent the feature modulator and the linear projection respectively. We reshape the onedimensional output feature into a matrix

X_{f} \in R^{M \times N}

, use four MLP-Mixer blocks [77] to process the features, and output the logits of token classification:
其中

C

和

L

分别代表特征调制器和线性投影。我们将一维输出特征重塑为矩阵

X_{f} \in R^{M \times N}

，使用四个 MLP-Mixer 模块 [77] 处理特征，并输出标记分类的对数：

\hat{L} = M (X_{f})

where

\hat{L}

has the shape of

R^{M \times V}

.
其中

\hat{L}

的形状为

R^{M \times V}

。

Training. We use two losses to train the classification head. First, we enforce the cross entropy loss:
训练。我们使用两种损失来训练分类头。首先，我们使用交叉熵损失：

ℓ_{c l s} = CE (\hat{L}, L)

where

L

denotes the ground-truth token classes obtained by feeding the ground-truth poses into the encoder.
其中，

L

表示将地面真实姿势输入编码器后得到的地面真实标记类别。

We also enforce a pose reconstruction loss, which minimizes the difference between the predicted and the groundtruth poses. To allow the gradients from the decoder network to flow back to the classification head, we replace the hard inference scheme with a soft version:
我们还强制执行姿态重构损失，使预测姿态与地面实况姿态之间的差异最小化。为了让解码器网络的梯度流回分类头，我们用软版本取代了硬推理方案：

S = \hat{L} \times C

where

S \in R^{M \times N}

denotes the linearly interpolated token features. The token features

S

are then fed to the prelearned decoder to obtain the predicted pose

\hat{G}

. The complete loss function is:
其中

S \in R^{M \times N}

表示线性插值的标记特征。然后将标记特征

S

输入预学习解码器，以获得预测姿势

\hat{G}

。完整的损失函数为

ℓ_{all} = C E (\hat{L}, L) + {smooth}_{L_{1}} (\hat{G}, G)

Note that the decoder network is not updated during training.
请注意，解码器网络在训练过程中不会更新。

4. Experiments 4.实验

We first extensively evaluate the PCT representation on five benchmark datasets in the context of 2D human pose estimation. Then we present the 3D pose estimation results and compare them to the state-of-the-art methods. Ablation studies about the main components of our method are also provided to help understand the approach.
我们首先以二维人体姿态估计为背景，在五个基准数据集上对 PCT 表示法进行了广泛评估。然后，我们介绍三维姿态估计结果，并将其与最先进的方法进行比较。我们还提供了有关我们方法主要组成部分的消融研究，以帮助理解我们的方法。

4.1. Datasets and metrics
4.1.数据集和指标

2D pose datasets. First, we conduct experiments on the COCO [42] and MPII [1] datasets. The COCO dataset has 150 K labeled human instances for training, 5 K images for validation, and

30 K

images for testing. The MPII dataset has 40 K labeled human instances performing a variety of activities. Second, we evaluate our method on four datasets that have severe occlusions, including the test set of the CrowdPose [35] dataset, the validation and test sets of the OCHuman [107] dataset, and the SyncOCC [108] dataset. In CrowdPose [35] and OCHuman [107], the occluded joints are manually labeled by annotators. The SyncOCC [108] dataset is a synthetic dataset generated by UnrealCV [91] so it provides accurate locations of the occluded joints. We directly apply the model trained on the COCO dataset to the four datasets without re-training. We report the results on the occluded joints to validate the capability of the model to handle occlusion.
二维姿势数据集。首先，我们在 COCO [42] 和 MPII [1] 数据集上进行了实验。COCO 数据集有 150 K 个用于训练的标注人体实例、5 K 张用于验证的图像和

30 K

张用于测试的图像。MPII 数据集有 40 K 个进行各种活动的标注人类实例。其次，我们在四个有严重遮挡的数据集上评估了我们的方法，包括 CrowdPose [35] 数据集的测试集、OCHuman [107] 数据集的验证集和测试集以及 SyncOCC [108] 数据集。在 CrowdPose [35] 和 OCHuman [107] 数据集中，闭塞关节是由标注者手动标注的。SyncOCC [108] 数据集是由 UnrealCV [91] 生成的合成数据集，因此能提供闭塞关节的准确位置。我们将在 COCO 数据集上训练的模型直接应用于这四个数据集，而无需重新训练。我们报告了闭塞关节的结果，以验证模型处理闭塞的能力。

3D pose datasets. We conduct experiments on the Human3.6M [30] dataset which has 11 human subjects performing daily actions. We follow the practice of the previous works such as [17]. In particular, five subjects (S1, S5,
三维姿势数据集。我们在Human3.6M[30]数据集上进行了实验，该数据集包含 11 个正在进行日常动作的人类受试者。我们沿用了 [17] 等前人的做法。其中，五个实验对象（S1, S5、

S6, S7, S8) are used for training, and two subjects (S9, S11) are used for testing. Since there are no labels for joint occlusion, we only compare our method to the state-of-the-art methods to validate the general applicability of the representation to both 2D and 3D poses.
S6, S7, S8）用于训练，两个实验对象（S9, S11）用于测试。由于没有关节咬合的标签，我们只将我们的方法与最先进的方法进行比较，以验证该表示法对二维和三维姿势的普遍适用性。

Evaluation metrics. We follow the standard evaluation metrics for the COCO [42], MPII [1] and, Human3.6M [30] datasets. In particular, the OKS-based AP (average precision),

{AP}^{50}

and

{AP}^{75}

are reported for the COCO dataset. The PCKh (head-normalized probability of correct keypoint) score is used for the MPII dataset. The MPJPE (mean per joint position error) are used for Human3.6M. On the four occlusion datasets, we report the

{AP}^{O C}

based on OKS computed only on the occluded joints.
评估指标。我们采用 COCO [42]、MPII [1] 和 Human3.6M [30] 数据集的标准评估指标。其中，COCO 数据集报告了基于 OKS 的 AP（平均精度）、

{AP}^{50}

和

{AP}^{75}

。MPII 数据集采用 PCKh（头部归一化关键点正确概率）得分。Human3.6M数据集使用的是MPJPE（平均每个关节位置误差）。在四个闭塞数据集上，我们根据仅在闭塞关节上计算的 OKS 报告

{AP}^{O C}

。

4.2. Implementation details
4.2.实施细节

We adopt the top-down estimation pipeline. In training, we use the GT boxes provided by the datasets. In testing, we use the detection results provided by [92] for COCO, and the GT boxes for MPII and the occlusion datasets following the common practice.
我们采用自上而下的估算方法。在训练中，我们使用数据集提供的 GT 框。在测试中，我们使用 [92] 提供的 COCO 检测结果，并按照惯例使用 MPII 和闭塞数据集的 GT 框。

We use the Swin Transformer V2 [44,45] backbone pretrained with SimMIM [94] on ImageNet-1k [66]. It is also trained on the COCO dataset with heatmap supervision. To save computation cost, we fix the backbone and only train the classification head. We set the base learning rate, weight decay and batch size to

8 e - 4, 0.05

and 256 , respectively. In total, we train the head for 210 epochs on COCO and MPII, and 50 epochs on Human3.6M. The flip testing is used.
我们在 ImageNet-1k [66] 上使用 SimMIM [94] 对 Swin Transformer V2 [44,45] 骨干进行预训练。它还在 COCO 数据集上进行了热图监督训练。为了节省计算成本，我们固定了骨干网，只训练分类头。我们将基本学习率、权重衰减和批量大小分别设置为

8 e - 4, 0.05

和 256。我们总共在 COCO 和 MPII 上训练了 210 个 epoch，在 Human3.6M 上训练了 50 个 epoch。使用翻转测试。

We use the default data augmentations provided by MMPose [18] including random scale (

0.5, 1.5

), random rotation

(- 40^{\circ}, 40^{\circ}

), random flip (

50 %

), grid dropout and color jitter (

h = 0.2, s = 0.4, c = 0.4, b = 0.4

). We also add the half body augmentation for COCO. The image size is

256 \times 256

.
我们使用 MMPose [18] 提供的默认数据增强功能，包括随机缩放（

0.5, 1.5

）、随机旋转（

(- 40^{\circ}, 40^{\circ}

）、随机翻转（

50 %

）、网格剔除和颜色抖动（

h = 0.2, s = 0.4, c = 0.4, b = 0.4

）。我们还为 COCO 增加了半身增强功能。图像大小为

256 \times 256

。

In learning the representation, we use the AdamW [46] optimizer with the base learning rate set to

1 e - 2

and weight decay to 0.15 , respectively. We warm up the learning rate for 500 iterations and drop the learning rate according to the cosine schedule. The batch size is 512 . We train 50 epochs for 2D pose and 20 epochs for 3D pose.
在学习表示时，我们使用 AdamW [46] 优化器，将基本学习率设置为

1 e - 2

，权重衰减设置为 0.15 。我们将学习率预热 500 次迭代，然后根据余弦计划降低学习率。批次大小为 512。我们对 2D 姿态训练 50 个历元，对 3D 姿态训练 20 个历元。

4.3. Results on COCO, MPII and H36M
4.3.COCO、MPII 和 H36M 的结果

COCO. Table 1 shows the results of the state-of-theart top-down pose estimation methods on COCO [42] testdev2017 and COCO val2017 sets, respectively. For our method, we provide three models of different sizes. We can see that they achieve better or comparable accuracy as the other methods. For example, our smallest model with SwinBase outperforms the previous dominant heatmap-based methods including HRNet [74], HRFormer [103], and TokenPose [40] with much faster inference speed. Similarly,
COCO。表 1 显示了先进的自顶向下姿态估计方法分别在 COCO [42] testdev2017 和 COCO val2017 数据集上的结果。对于我们的方法，我们提供了三种不同大小的模型。我们可以看到，它们达到了与其他方法更好或相当的精度。例如，我们的最小模型 SwinBase 的推理速度远远超过了之前主流的基于热图的方法，包括 HRNet [74]、HRFormer [103] 和 TokenPose [40]。类似的还有

Table 1. Results on the COCO test-dev2017 and val2017 sets. The best results in the cited papers are reported. We set the batch size to 32 when testing the speed of all models on a single V100 GPU. Since the official pre-trained model of Swin [44] use square windows, we directly adopt the square input size to avoid domain gaps. While our input size seems larger than the competitors (e.g.

256 \times 256 vs

256 \times 192

), the number of valid pixels is almost the same because the additional regions are mostly padded meaningless pixels.
表 1.COCO test-dev2017 和 val2017 数据集的结果。报告了引用论文中的最佳结果。在单个 V100 GPU 上测试所有模型的速度时，我们将批量大小设置为 32。由于 Swin [44] 的官方预训练模型使用正方形窗口，因此我们直接采用正方形输入大小，以避免域间隙。虽然我们的输入大小似乎比竞争对手更大（例如

256 \times 256 vs

256 \times 192

），但有效像素的数量几乎相同，因为额外的区域大多是填充的无意义像素。

Method 方法	Backbone 骨干网	Input size 输入尺寸	GFLOPs $↓$ GFLOP $↓$	Speed (fps) $↑$ 速度（帧/秒） $↑$	COCO test-dev2017 $↑$			COCO val2017 $↑$
Method 方法	Backbone 骨干网	Input size 输入尺寸	GFLOPs $↓$ GFLOP $↓$	Speed (fps) $↑$ 速度（帧/秒） $↑$	AP	${AP}^{50}$	${AP}^{75}$	AP	${AP}^{50}$	${AP}^{75}$
SimBa. [92] SimBa.[92]	ResNet-152	$384 \times 288$	28.7	76.3	73.7	91.9	81.1	74.3	89.6	81.1
PRTR [36]	HRNet-W32	$384 \times 288$	21.6	87.0	71.7	90.6	79.6	73.1	89.4	79.8
TransPose [97] TransPose [97］	HRNet-W48	$256 \times 192$	21.8	56.7	75.0	92.2	82.3	75.8	90.1	82.1
TokenPose [40] okenPose [40]	HRNet-W48	$256 \times 192$	22.1	52.9	75.9	92.3	83.4	75.8	90.3	82.5
HRNet [74,86] 人力资源网[74,86］	HRNet-W48	$384 \times 288$	35.5	75.5	75.5	92.7	83.3	76.3	90.8	82.9
DARK [105] 黑暗 [105］	HRNet-W48	$384 \times 288$	35.5	62.1	76.2	92.5	83.6	76.8	90.6	83.2
UDP [29]	HRNet-W48	$384 \times 288$	35.5	67.9	76.5	92.7	84.0	77.8	92.0	84.3
SimCC [39]	HRNet-W48	$384 \times 288$	32.9	71.4	76.0	92.4	83.5	76.9	90.9	83.2
HRFormer [103]	HRFormer-B 人力资源表格 B	$384 \times 288$	29.1	25.2	76.2	92.7	83.8	77.2	91.0	83.6
ViTPose [96]	ViT-Base	$256 \times 192$	17.9	113.5	75.1	92.5	83.1	75.8	90.7	83.2
ViTPose [96]	ViT-Large 大号 ViT	$256 \times 192$	59.8	40.5	77.3	93.1	85.3	78.3	91.4	85.2
ViTPose [96]	ViT-Huge	$256 \times 192$	122.9	21.8	78.1	93.3	85.7	79.1	91.6	85.7
SimBa. [92] SimBa.[92]	Swin-Base 斯温基地	$256 \times 256$	16.6	74.4	75.4	93.0	84.1	76.6	91.4	84.3
Our approach 我们的方法	Swin-Base 斯温基地	$256 \times 256$	15.2	115.1	76.5	92.5	84.7	77.7	91.2	84.7
Our approach 我们的方法	Swin-Large 斯温-大号	$256 \times 256$	34.1	76.4	77.4	92.9	85.2	78.3	91.4	85.3
Our approach 我们的方法	Swin-Huge 斯温-胡格	$256 \times 256$	118.2	31.7	78.3	92.9	85.9	79.3	91.5	85.9

| Method | Backbone | Input size | GFLOPs $\downarrow$ | Speed (fps) $\uparrow$ | COCO test-dev2017 $\uparrow$ | | | COCO val2017 $\uparrow$ | | | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | | | | | AP | $\mathrm{AP}^{50}$ | $\mathrm{AP}^{75}$ | AP | $\mathrm{AP}^{50}$ | $\mathrm{AP}^{75}$ | | SimBa. [92] | ResNet-152 | $384 \times 288$ | 28.7 | 76.3 | 73.7 | 91.9 | 81.1 | 74.3 | 89.6 | 81.1 | | PRTR [36] | HRNet-W32 | $384 \times 288$ | 21.6 | 87.0 | 71.7 | 90.6 | 79.6 | 73.1 | 89.4 | 79.8 | | TransPose [97] | HRNet-W48 | $256 \times 192$ | 21.8 | 56.7 | 75.0 | 92.2 | 82.3 | 75.8 | 90.1 | 82.1 | | TokenPose [40] | HRNet-W48 | $256 \times 192$ | 22.1 | 52.9 | 75.9 | 92.3 | 83.4 | 75.8 | 90.3 | 82.5 | | HRNet [74,86] | HRNet-W48 | $384 \times 288$ | 35.5 | 75.5 | 75.5 | 92.7 | 83.3 | 76.3 | 90.8 | 82.9 | | DARK [105] | HRNet-W48 | $384 \times 288$ | 35.5 | 62.1 | 76.2 | 92.5 | 83.6 | 76.8 | 90.6 | 83.2 | | UDP [29] | HRNet-W48 | $384 \times 288$ | 35.5 | 67.9 | 76.5 | 92.7 | 84.0 | 77.8 | 92.0 | 84.3 | | SimCC [39] | HRNet-W48 | $384 \times 288$ | 32.9 | 71.4 | 76.0 | 92.4 | 83.5 | 76.9 | 90.9 | 83.2 | | HRFormer [103] | HRFormer-B | $384 \times 288$ | 29.1 | 25.2 | 76.2 | 92.7 | 83.8 | 77.2 | 91.0 | 83.6 | | ViTPose [96] | ViT-Base | $256 \times 192$ | 17.9 | 113.5 | 75.1 | 92.5 | 83.1 | 75.8 | 90.7 | 83.2 | | ViTPose [96] | ViT-Large | $256 \times 192$ | 59.8 | 40.5 | 77.3 | 93.1 | 85.3 | 78.3 | 91.4 | 85.2 | | ViTPose [96] | ViT-Huge | $256 \times 192$ | 122.9 | 21.8 | 78.1 | 93.3 | 85.7 | 79.1 | 91.6 | 85.7 | | SimBa. [92] | Swin-Base | $256 \times 256$ | 16.6 | 74.4 | 75.4 | 93.0 | 84.1 | 76.6 | 91.4 | 84.3 | | Our approach | Swin-Base | $256 \times 256$ | 15.2 | 115.1 | 76.5 | 92.5 | 84.7 | 77.7 | 91.2 | 84.7 | | Our approach | Swin-Large | $256 \times 256$ | 34.1 | 76.4 | 77.4 | 92.9 | 85.2 | 78.3 | 91.4 | 85.3 | | Our approach | Swin-Huge | $256 \times 256$ | 118.2 | 31.7 | 78.3 | 92.9 | 85.9 | 79.3 | 91.5 | 85.9 |

Table 2. Results on the MPII [1] val set (PCKh@0.5).
表 2.MPII [1] 验证集的结果 (PCKh@0.5)。

Method 方法	Hea.	Sho. 翔。	Elb. 埃尔博	Wri.	Hip. 臀部	Kne.	Ank. 安克	Mean 平均值
SimBa. [92] SimBa.[92]	97.0	95.6	90.0	86.2	89.7	86.9	82.9	90.2
PRTR [36]	97.3	96.0	90.6	84.5	89.7	85.5	79.0	89.5
HRNet [74, 87] 人力资源网 [74, 87］	97.1	95.9	90.3	86.4	89.1	87.1	83.3	90.3
DARK [105] 黑暗 [105］	97.2	95.9	91.2	86.7	89.7	86.7	84.0	90.6
TokenPose [40] okenPose [40]	97.1	95.9	90.4	86.0	89.3	87.1	82.5	90.2
SimCC [39]	97.2	96.0	90.4	85.6	89.5	85.8	81.8	90.0
Our (Swin-Base) 我们的（斯温基地）	97.5	97.2	92.8	88.4	92.4	89.6	87.1	92.5

our largest model also achieves better results than the state-of-the-art ViTPose (huge) with

1.5 x

faster inference speed. The fast inference speed is mainly due to the fact that our method does not require any expensive post-processing.
我们的最大模型也取得了比最先进的 ViTPose（巨大）更好的结果，推理速度

1.5 x

更快。推理速度快的主要原因是我们的方法不需要任何昂贵的后处理。

MPII. The results on the MPII validation set are shown in Table 2. The image size is set to be

256 \times 256

for all methods. Our approach significantly surpasses the other methods. Our approach gets better performance mainly for the joints on the lower body which are easier to be occluded by other objects. Compared to the other classification-based method SimCC [39], our method achieves an improvement of 2.5 under the metric of PCKh@0.5.
MPII.MPII 验证集的结果如表 2 所示。所有方法的图像大小都设定为

256 \times 256

。我们的方法明显优于其他方法。我们的方法主要在下半身的关节上取得了更好的性能，因为这些关节更容易被其他物体遮挡。与其他基于分类的方法 SimCC [39]相比，我们的方法在 PCKh@0.5 的指标下提高了 2.5。

H36M. It is straightforward to apply the PCT representation to 3D pose estimation. We first learn the encoder, the codebook and the decoder on the 3D poses. Then we train a classification head for 3D pose estimation. For simplic-
H36M.将 PCT 表示法直接应用于三维姿态估计。我们首先在三维姿态上学习编码器、编码本和解码器。然后，我们训练用于三维姿态估计的分类头。对于简单的

Table 3. 3D pose estimation results on the Human3.6M [30] dataset. ‘*’ means using extra 2D MPII [1] dataset for training. We report the MPJPE metric (mm). We only compare to the static image-based methods in the table.
表 3.Human3.6M [30] 数据集上的三维姿态估计结果。*"表示使用额外的 2D MPII [1] 数据集进行训练。我们报告的是 MPJPE 指标（毫米）。我们在表中只与基于静态图像的方法进行比较。

58.058.0 57.654.452.451.950.950.249.949.6 49.4 48.6 50.8 47.8

ity, we directly use the backbone used in 2D pose estimation without re-training. The results are shown in Table 3. Our approach achieves a smaller error than the state-of-theart monocular image-based methods. The results show that PCT is general and applies to both 2D and 3D poses.
因此，我们直接使用二维姿态估计中使用的骨干，而无需重新训练。结果如表 3 所示。与最先进的基于单目图像的方法相比，我们的方法误差更小。结果表明，PCT 具有通用性，适用于二维和三维姿势。

4.4. Results on CrowdPose, OCHuman, SyncOCC
4.4.CrowdPose、OCHuman、SyncOCC 的结果

We evaluate how our method performs in severe occlusions. The results on the four occlusion datasets are shown in Table 4. We can see that our PCT based approach significantly outperforms the other methods. Figure 5 shows some examples. There are several interesting observations. First, when a large portion of the human body is occluded, our method can predict a reasonable configuration for the occluded joints that is in harmony with the visible joints although there are no supporting visual features. This validates the strong context modeling capability of our method.
我们评估了我们的方法在严重遮挡情况下的表现。四个遮挡数据集的结果如表 4 所示。我们可以看到，基于 PCT 的方法明显优于其他方法。图 5 显示了一些示例。有几个有趣的观察结果。首先，当人体的大部分被遮挡时，虽然没有辅助视觉特征，我们的方法仍能预测出被遮挡关节的合理配置，并与可见关节保持一致。这验证了我们的方法具有很强的情境建模能力。

Table 4. The results of the state-of-the-art methods on the occlusion datasets. The numbers of the competitors are obtained by running their official models using the MMPose [18] framework. The metrics are computed only on the occluded joints that overlap with the COCO annotated joints. The GT bounding box is used. ‘OC’ denotes the OCHuman [107] dataset.
表 4.最先进方法在闭塞数据集上的结果。竞争对手的数据是通过使用 MMPose [18] 框架运行其官方模型得出的。指标仅计算与 COCO 标注关节重叠的闭塞关节。使用的是 GT 边界框。OC "表示 OCHuman [107] 数据集。

Method 方法	Backbone 骨干网	Input size 输入尺寸	Speed (fps) $↑$ 速度（帧/秒） $↑$	2D Occluded Pose Estimation (AP $^{O C} ↑$ ) 2D 遮挡姿势估计（AP $^{O C} ↑$ )
Method 方法	Backbone 骨干网	Input size 输入尺寸	Speed (fps) $↑$ 速度（帧/秒） $↑$	OC-val [107]	OC-test [107] OC 测试 [107]	CrowdPose [35]	SyncOCC [108] 同步协调委员会 [108]	SyncOCC-H [108]
HRNet [74,86] 人力资源网[74,86］	HRNet-W48	$384 \times 288$	75.5	38.1	38.1	74.5	90.8	73.0
DARK [105] 黑暗 [105］	HRNet-W48	$384 \times 288$	62.1	38.6	39.2	74.9	91.2	73.8
UDP [29]	HRNet-W48	$384 \times 288$	67.9	38.6	38.8	75.0	90.8	73.0
HRFormer [103]	HRFormer-B 人力资源表格 B	$384 \times 288$	25.2	40.5	40.3	72.4	91.9	75.7
Poseur [49]	HRFormer-B 人力资源表格 B	$384 \times 288$	25.8	44.4	45.6	73.9	93.1	78.5
ViTPose [96]	ViT-Huge	$256 \times 192$	21.8	46.7	45.8	74.7	92.3	77.4
SimBa. [92] SimBa.[92]	Swin-Base 斯温基地	$256 \times 256$	74.4	40.1	39.8	71.6	90.7	72.4
Our approach 我们的方法	Swin-Base 斯温基地	$256 \times 256$	115.1	45.6	44.5	73.9	93.0	78.3
Our approach 我们的方法	Swin-Large 斯温-大号	$256 \times 256$	76.4	47.2	47.0	76.8	93.4	78.9
Our approach 我们的方法	Swin-Huge 斯温-胡格	$256 \times 256$	31.7	$50.8$	$49.6$	$77.2$	$94.0$	$79.7$

Table 5. Comparison of four pose representations in a completely fair setting. We conduct the experiments with the Swin-Base and input size

256 \times 256

. The results are reported on the occluded joints that overlap with the COCO annotated joints. The GT bounding box is used. ‘OC’ denotes the OCHuman [107] dataset.
表 5.完全公平环境下四种姿势表示法的比较。我们使用 Swin-Base 和输入大小

256 \times 256

进行了实验。报告的是与 COCO 标注的关节重叠的闭合关节的结果。使用的是 GT 边界框。OC "表示 OCHuman [107] 数据集。

Method 方法	OC-val	OC-test OC 测试	CrowdPose
SyncOCC	SyncOCC-H
Heatmaps 热图	40.1	39.8	71.6	90.7	72.4
Discrete Bins 离散箱	40.5	39.9	71.9	91.1	73.6
Coordinates 坐标	41.5	41.5	72.7	91.9	75.7
Our PCT 我们的 PCT	$45.6$	$44.5$	$73.9$	$93.0$	$78.3$

Table 6. Ablation study of four main components: Compo (compositional design), MJM (masked joint modeling), IG (image guidance), and RecLoss (auxiliary pose reconstruction Loss). We report the

{AP}^{V}

for reconstructed poses,

{AP}^{P}

for predicted poses on the COCO val2017 set, and

{AP}^{O C}

on the SyncOCC test set. All results are obtained with the backbone Swin-Base and input size

256 \times 256

.
表 6.四个主要组件的消融研究：Compo（构图设计）、MJM（屏蔽关节建模）、IG（图像引导）和 RecLoss（辅助姿势重建损失）。我们报告了在 COCO val2017 集上重建姿势的

{AP}^{V}

、预测姿势的

{AP}^{P}

，以及在 SyncOCC 测试集上的

{AP}^{O C}

。所有结果都是在骨干 Swin-Base 和输入大小

256 \times 256

的情况下得出的。

Compo	MJM	IG	RecLoss	${AP}^{V}$	${AP}^{P}$	${AP}^{OC}$
				33.1	16.2	56.8
$✓$				98.9	65.5	88.3
$✓$	$✓$			99.0	72.7	91.2
$✓$	$✓$	$✓$		99.0	75.1	92.8
$✓$	$✓$	$✓$	$✓$	99.0	77.4	93.1

Second, when a small portion is occluded, our method can predict accurate positions based on the visual features in the neighborhood. For example, in the fourth example of the first row, the ankle joint of the rightmost person is correctly predicted based on the visual features of the legs. Third, it also shows stronger capability to resolve the ambiguities of other distracting persons.
其次，当一小部分被遮挡时，我们的方法可以根据邻域的视觉特征预测出准确的位置。例如，在第一行的第四个例子中，根据腿部的视觉特征，最右边的人的踝关节被正确预测。第三，我们的方法还具有更强的能力来解决其他干扰人员的模糊问题。

We also compare the four representations including the coordinates, heatmaps, discrete bins, and PCT in a completely fair setting. The results are shown in Table 5. We can see that PCT achieves much better results than the dominant heatmap representation, leading by about 5.0 AP on
我们还在完全公平的情况下比较了坐标、热图、离散分块和 PCT 等四种表示方法。结果如表 5 所示。我们可以看到，PCT 比主要的热图表示法取得了更好的结果，在以下方面领先约 5.0 个百分点

Figure 4. Impact of the number of codebook entries and the number of tokens, respectively. The results are obtained by the model using the Swin-Base backbone trained for 150 epochs on the COCO val2017 dataset.
图 4.代码集条目数和词条数的影响。这些结果是使用 Swin-Base 骨干模型在 COCO val2017 数据集上训练 150 个历时后得出的。

OCHuman, 2.3 AP on SyncOCC, and 5.9 AP on the more challenging SyncOCC hard set.
OCHuman、SyncOCC 2.3 AP 和更具挑战性的 SyncOCC 硬集 5.9 AP。

4.5. Empirical analysis 4.5.经验分析

Ablation study. We ablate the main components of PCT that we think are important. It includes the Compositional design (Compo), Masked Joints Modeling (MJM), Image Guidance (IG), and auxiliary Pose Reconstruction Loss (RecLoss). All experiments are conducted on the COCO val set and the SyncOCC set, using the Swin-Base backbone trained for 150 epochs.
消融研究。我们消融了我们认为重要的 PCT 主要组成部分。其中包括构图设计（Compo）、遮蔽关节建模（MJM）、图像引导（IG）和辅助姿势重构损失（RecLoss）。所有实验均在 COCO val 集和 SyncOCC 集上进行，使用 Swin-Base 骨干进行了 150 个 epochs 的训练。

The first baseline discards the compositional design and learns a codebook for each joint without interactions between the joints. As can be seen in Table

6, {AP}^{V}

is only

33.1 %

meaning that the codebook cannot even reconstruct the poses accurately. This is because we need a significantly larger codebook without the compositional design. As a result, the pose estimation accuracy

{AP}^{P}

in the downstream task is only

16.2 %

. Adding the compositional design di-
第一条基线摒弃了组合设计，为每个关节学习一个没有关节间相互作用的代码集。从表中可以看出，

6, {AP}^{V}

仅为

33.1 %

，这意味着代码集甚至无法准确地重建姿势。这是因为，如果没有组合设计，我们需要的代码集要大得多。因此，下游任务中的姿势估计精度

{AP}^{P}

仅为

16.2 %

。添加合成设计后

Figure 5. Qualitative results of our approach with Swin-Base backbone. The images are obtained from OChuman test set, COCO val2017 set, CrowdPose test set, and SyncOCC.
图 5.我们的方法与 Swin-Base 骨干的定性结果。图片来自 OChuman 测试集、COCO val2017 测试集、CrowdPose 测试集和 SyncOCC。
rectly improves

{AP}^{V}

98.9 %

. Adding MJM improves

{AP}^{P}

significantly from

65.5 %

72.7 %

. Our understanding is that MJM can drive the model to learn meaningful sub-structures (tokens) to help detect masked joints. IG and RecLoss also improve the results.
正确地将

{AP}^{V}

改善为

98.9 %

。添加 MJM 后，

{AP}^{P}

从

65.5 %

显著提高到

72.7 %

。我们的理解是，MJM 可以驱动模型学习有意义的子结构（标记），以帮助检测遮蔽的接头。IG 和 RecLoss 也改善了结果。

Token number. Increasing the number of tokens

M

will enlarge the representation space exponentially. The results are shown in Figure 4. We can see that increasing

M

from 4 to 16 notably improves the AP on the COCO dataset. Further increasing

M

brings little improvement. We find this is because the newly added tokens become redundant and have a large overlap with the existing ones. However, the results are barely affected by the redundant tokens which make the approach robust to the parameter.
标记数。增加标记数

M

将以指数方式扩大表示空间。结果如图 4 所示。我们可以看到，将

M

从 4 个增加到 16 个，明显改善了 COCO 数据集上的 AP。进一步增加

M

则改善甚微。我们发现这是因为新添加的标记变得多余，并且与现有标记有大量重叠。但是，结果几乎不受冗余标记的影响，这使得该方法对参数具有鲁棒性。

Codebook size. Increasing the number of entries

V

in the codebook decreases the quantization error. However, it also increases the classification difficulty as the number of categories becomes larger. The results are shown in Figure 4. Setting this number between 256 and 2048 gives satisfactory results. Again, the model is not very sensitive to this parameter.
编码本大小。增加编码本中

V

条目的数量可以减少量化误差。但是，随着类别数量的增加，分类难度也会增加。结果如图 4 所示。将这一数字设置在 256 到 2048 之间会得到令人满意的结果。同样，模型对这个参数也不是很敏感。

Qualitative results. Figure 5 shows some pose estimation results. We can see that it handles occlusion in a reasonable way. When a human body is occluded by a large
定性结果。图 5 显示了一些姿态估计结果。我们可以看到，它以合理的方式处理了遮挡问题。当人体被大面积遮挡时
region where even people are not completely sure about the exact pose, our method can predict a reasonable pose although it may be different from the GT pose. Note that they are not cherry-picked results. The last three examples show the failure cases. For the two people on the chair example, it is probable that the right ankle joint should be somewhere occluded by the chair. Similarly, for the person skating example, the ankle joints should be near the skateboard. The results suggest that leveraging objects as the context may further improve the estimation results.
在一些区域，即使人们并不完全确定确切的姿势，我们的方法也能预测出合理的姿势，尽管它可能与 GT 姿势不同。请注意，这些结果并非偷梁换柱。最后三个例子显示了失败案例。在两个人坐在椅子上的例子中，右脚踝关节很可能被椅子遮挡住了。同样，在滑冰的例子中，踝关节应该在滑板附近。结果表明，利用物体作为上下文可以进一步改善估算结果。

5. Conclusion 5.结论

In this work, we introduce a structured representation PCT to the human pose estimation community, which models the dependency between the body joints and automatically learns the sub-structures of the human pose. We also present a very simple pose estimation pipeline on top of the PCT representation, which does not need any complicated post-processing. It achieves better or comparable results as the state-of-the-art methods on five benchmarks. The discrete representation also paves the way for interacting with other discrete modalities such as text and speech.
在这项工作中，我们为人体姿态估计领域引入了一种结构化表示法 PCT，它可以模拟身体关节之间的依赖关系，并自动学习人体姿态的子结构。我们还在 PCT 表示法的基础上提出了一个非常简单的姿势估计管道，它不需要任何复杂的后处理。它在五个基准测试中取得了与最先进方法相当或更好的结果。离散表示法还为与文本和语音等其他离散模态的交互铺平了道路。
Future work. It will be interesting to further reduce the ambiguities in pose estimation by exploring other cues under the discrete representation. For example, as mentioned in the qualitative study, we can model the context from the environments such as the surrounding objects.
未来工作。通过探索离散表征下的其他线索来进一步减少姿势估计中的模糊性将是非常有趣的。例如，正如定性研究中提到的，我们可以从周围的物体等环境中建立情境模型。

References 参考资料

[1] Mykhaylo Andriluka, Leonid Pishchulin, Peter V. Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, pages 36863693, 2014. 5, 6
[1] Mykhaylo Andriluka、Leonid Pishchulin、Peter V. Gehler 和 Bernt Schiele.2d 人体姿态估计：新基准和最新技术分析。CVPR，第 36863693 页，2014 年。5, 6
[2] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In 2009 IEEE conference on computer vision and pattern recognition, pages 1014-1021. IEEE, 2009. 1, 3
[2] Mykhaylo Andriluka、Stefan Roth 和 Bernt Schiele.重访图像结构：人物检测与关节姿态估计。In 2009 IEEE conference on computer vision and pattern recognition, pages 1014-1021.IEEE, 2009.1, 3
[3] Bruno Artacho and Andreas E. Savakis. Unipose: Unified human pose estimation in single images and videos. In CVPR, pages 7033-7042, 2020. 2
[3] Bruno Artacho 和 Andreas E. Savakis.Unipose：单一图像和视频中的统一人体姿态估计。In CVPR, pages 7033-7042, 2020.2
[4] Vasileios Belagiannis, Christian Rupprecht, Gustavo Carneiro, and Nassir Navab. Robust optimization for deep regression. In ICCV, pages 2830-2838, 2015. 2
[4] Vasileios Belagiannis、Christian Rupprecht、Gustavo Carneiro 和 Nassir Navab。深度回归的鲁棒优化。In ICCV, pages 2830-2838, 2015.2
[5] Irving Biederman, Robert J Mezzanotte, and Jan C Rabinowitz. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143-177, 1982. 1
[5] Irving Biederman、Robert J Mezzanotte 和 Jan C Rabinowitz。场景感知：检测和判断违反关系的物体》。认知心理学，14（2）：143-177，1982.1
[6] Adrian Bulat and Georgios Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, pages 717-732, 2016. 2
[6] Adrian Bulat 和 Georgios Tzimiropoulos.通过卷积部分热图回归进行人体姿态估计。In ECCV, pages 717-732, 2016.2
[7] Yuanhao Cai, Zhicheng Wang, Zhengxiong Luo, Binyi Yin, Angang Du, Haoqian Wang, Xiangyu Zhang, Xinyu Zhou, Erjin Zhou, and Jian Sun. Learning delicate local representations for multi-person pose estimation. In

E C C V

, pages 455-472, 2020. 2
[7] 蔡元浩、王志成、罗正雄、尹斌义、杜安刚、王浩乾、张翔宇、周新宇、周尔进和孙健。学习用于多人姿态估计的精细局部表示。2020年，

E C C V

，第455-472页。2
[8] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291-7299, 2017. 2
[8] 曹哲、托马斯-西蒙、魏世恩、雅瑟-谢赫。使用部分亲和场进行实时多人 2d 姿势估计。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291-7299, 2017.2
[9] João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In CVPR, pages 4733-4742, 2016. 2
[9] João Carreira、Pulkit Agrawal、Katerina Fragkiadaki 和 Jitendra Malik。迭代误差反馈的人体姿态估计。在 CVPR 上，第 4733-4742 页，2016 年。2
[10] Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, and Geoffrey E. Hinton. A unified sequence interface for vision tasks. CoRR, abs/2206.07669, 2022. 3
[10] Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, and Geoffrey E. Hinton.视觉任务的统一序列界面。CoRR，abs/2206.07669，2022.3
[11] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial posenet: A structure-aware convolutional network for human pose estimation. In ICCV, pages 1221-1230, 2017. 3
[11] Yu Chen，Chunhua Shen，Xiu-Shen Wei，Lingqiao Liu，and Jian Yang.Adversarial posenet: A structure-aware convolutional network for human pose estimation.In ICCV, pages 1221-1230, 2017.3
[12] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, pages 71037112, 2018. 2
[12] Yilun Chen、Zhicheng Wang、Yuxiang Peng、Zhiqiang Zhang、Gang Yu 和 Jian Sun。用于多人姿态估计的级联金字塔网络。In CVPR, pages 71037112, 2018.2
[13] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang. Higherhrnet: Scaleaware representation learning for bottom-up human pose estimation. In CVPR, 2020. 2
[13] Bowen Cheng，Bin Xiao，Jingdong Wang，Honghui Shi，Thomas S. Huang，and Lei Zhang.Higherhrnet：用于自下而上人体姿态估计的规模感知表示学习。CVPR, 2020.2
[14] Yu Cheng, Bo Wang, and Robby Tan. Dual networks based 3d multi-person pose estimation from monocular video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 1
[14] Yu Cheng, Bo Wang, and Robby Tan.基于双网络的单目视频三维多人姿态估计。IEEE 模式分析与机器智能论文集，2022.1
[15] Yu Cheng, Bo Wang, Bo Yang, and Robby T Tan. Graph and temporal convolutional networks for 3d multi-person
[15] Yu Cheng，Bo Wang，Bo Yang，and Robby T Tan.用于 3D 多人
pose estimation in monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1157-1165, 2021. 1
《单目视频中的姿势估计。美国人工智能学会会议论文集》，第 35 卷，第 1157-1165 页，2021 年。1
[16] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose estimation. In CVPR, pages 4715-4723, 2016. 3
[16] Xiao Chu，Wanli Ouyang，Hongsheng Li，and Xiaogang Wang.用于姿态估计的结构化特征学习。In CVPR, pages 4715-4723, 2016.3
[17] Hai Ci, Chunyu Wang, Xiaoxuan Ma, and Yizhou Wang. Optimizing network structure for 3d human pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2262-2271, 2019. 5
[17] Hai Ci，Chunyu Wang，Xiaoxuan Ma，and Yizhou Wang.优化三维人体姿态估计的网络结构。In Proceedings of the IEEE/CVF international conference on computer vision, pages 2262-2271, 2019.5
[18] MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/openmmlab/mmpose, 2020. 5, 7
[18] MMPose 贡献者。Openmmlab pose estimation toolbox and benchmark.https://github.com/openmmlab/mmpose, 2020.5, 7
[19] Xiaochuan Fan, Kang Zheng, Yuewei Lin, and Song Wang. Combining local appearance and holistic view: Dualsource deep neural networks for human pose estimation. In CVPR, pages 1347-1355, 2015.2
[19] 范小川、郑康、林岳伟和王松。结合局部外观和整体视图：用于人体姿态估计的双源深度神经网络。In CVPR, pages 1347-1355, 2015.2
[20] Haoshu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: regional multi-person pose estimation. In ICCV, pages 2353-2362, 2017. 2
[20] Haoshu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu.RMPE：区域多人姿态估计。In ICCV, pages 2353-2362, 2017.2
[21] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial structures for object recognition. International journal of computer vision, 61(1):55-79, 2005. 1, 3
[21] Pedro F Felzenszwalb 和 Daniel P Huttenlocher.用于物体识别的图像结构。国际计算机视觉杂志，61（1）：55-79，2005.1, 3
[22] Mihai Fieraru, Anna Khoreva, Leonid Pishchulin, and Bernt Schiele. Learning to refine human pose estimation. In CVPR, pages 205-214, 2018. 2
[22] Mihai Fieraru、Anna Khoreva、Leonid Pishchulin 和 Bernt Schiele。学习完善人体姿态估计。In CVPR, pages 205-214, 2018.2
[23] Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, and Jingdong Wang. Bottom-up human pose estimation via disentangled keypoint regression. In CVPR, pages 1467614686, 2021. 1, 2
[23] 耿志刚、孙科、肖斌、张兆祥、王敬东。通过分离关键点回归进行自下而上的人体姿态估计。CVPR，第 1467614686 页，2021 年。1, 2
[24] Kehong Gong, Jianfeng Zhang, and Jiashi Feng. Poseaug: A differentiable pose augmentation framework for 3d human pose estimation. In CVPR, pages 8575-8584, 2021. 6
[24] Kehong Gong，Jianfeng Zhang，and Jiashi Feng.Poseaug：用于三维人体姿态估计的可微分姿态增强框架。In CVPR, pages 8575-8584, 2021.6
[25] Kerui Gu, Linlin Yang, and Angela Yao. Removing the bias of integral pose regression. In ICCV, pages 11047-11056. IEEE, 2021. 2
[25] Kerui Gu、Linlin Yang 和 Angela Yao。消除积分姿态回归的偏差。在 ICCV 上，第 11047-11056 页。IEEE, 2021.2
[26] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000-16009, 2022. 4
[26] 何开明、陈新磊、谢赛宁、李阳浩、Piotr Dollár 和 Ross Girshick。遮蔽式自动编码器是可扩展的视觉学习器。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 16000-16009 页，2022 年。4
[27] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In ICCV, pages 2980-2988, 2017. 2
[27] 何开明、Georgia Gkioxari、Piotr Dollár 和 Ross B. Girshick。Mask R-CNN.In ICCV, pages 2980-2988, 2017.2
[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. 4
[28] 何开明、张翔宇、任绍清、孙健。图像识别的深度残差学习。CoRR，abs/1512.03385，2015.4
[29] Junjie Huang, Zheng Zhu, Feng Guo, and Guan Huang. The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, pages 5699-5708, 2020. 2, 6, 7
[29] Junjie Huang，Zheng Zhu，Feng Guo，and Guan Huang.细节决定成败：深入研究用于人体姿态估计的无偏数据处理。In CVPR, pages 5699-5708, 2020.2, 6, 7
[30] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325-1339, 2014. 5, 6
[30] Catalin Ionescu、Dragos Papava、Vlad Olaru 和 Cristian Sminchisescu。Human3.6m：自然环境中 3d 人体感应的大规模数据集和预测方法。IEEE Trans.Pattern Anal.机器。Intell., 36(7):1325-1339, 2014.5, 6
[31] Umar Iqbal, Pavlo Molchanov, and Jan Kautz. Weaklysupervised 3d human pose learning via multi-view images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 52435252, 2020. 1
[31] Umar Iqbal、Pavlo Molchanov 和 Jan Kautz.通过野外多视角图像进行弱监督三维人体姿态学习。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 52435252, 2020.1
[32] JongMok Kim, Hwijun Lee, Jaeseung Lim, Jongkeun Na, Nojun Kwak, and Jin Young Choi. Pose-mum: Reinforcing key points relationship for semi-supervised human pose estimation. arXiv preprint arXiv:2203.07837, 2022. 2
[32] JongMok Kim、Hwijun Lee、Jaeseung Lim、Jongkeun Na、Nojun Kwak 和 Jin Young Choi。Pose-mum：ArXiv preprint arXiv:2203.07837, 2022.2
[33] Muhammed Kocabas, Salih Karagoz, and Emre Akbas. Multiposenet: Fast multi-person pose estimation using pose residual network. In

E C C V

, volume 11215, pages 437-453, 2018. 2
[33] Muhammed Kocabas、Salih Karagoz 和 Emre Akbas。Multiposenet：使用姿态残差网络快速估计多人姿态。载于

E C C V

，第 11215 卷，第 437-453 页，2018 年。2
[34] Jiefeng Li, Siyuan Bian, Ailing Zeng, Can Wang, Bo Pang, Wentao Liu, and Cewu Lu. Human pose regression with residual log-likelihood estimation. In ICCV, 2021. 1, 2, 6
[34] 李杰峰、边思远、曾爱玲、王灿、庞博、刘文涛和吕策武。利用残差对数似然估计进行人体姿态回归。ICCV, 2021.1, 2, 6
[35] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, 2019. 5, 7
[35] Jiefeng Li, Can Wang, Hao Zhu, Yihuan Mao, Hao-Shu Fang, and Cewu Lu.Crowdpose：高效拥挤场景姿态估计与新基准。In CVPR, 2019.5, 7
[36] Ke Li, Shijie Wang, Xiang Zhang, Yifan Xu, Weijian Xu, and Zhuowen Tu. Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19441953, June 2021. 2, 6
[36] Ke Li、Shijie Wang、Xiang Zhang、Yifan Xu、Weijian Xu 和 Zhuowen Tu。使用级联变换器进行姿态识别。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 19441953 页，2021 年 6 月。2, 6
[37] Shichao Li, Lei Ke, Kevin Pratama, Yu-Wing Tai, ChiKeung Tang, and Kwang-Ting Cheng. Cascaded deep monocular 3d human pose estimation with evolutionary training data. In CVPR, pages 6172-6182. Computer Vision Foundation / IEEE, 2020. 6
[37] Shichao Li、Lei Ke、Kevin Pratama、Yu-Wing Tai、ChiKeung Tang 和 Kwang-Ting Cheng。使用进化训练数据的级联深度单目 3D 人体姿态估计。在 CVPR 上，第 6172-6182 页。计算机视觉基金会/IEEE，2020 年。6
[38] Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi Xiao, Gang Yu, Hongtao Lu, Yichen Wei, and Jian Sun. Rethinking on multi-stage networks for human pose estimation.

C o R R

, abs/1901.00148, 2019. 2
[38] 李文博、王志成、尹斌义、彭启祥、杜玉明、肖天子、于刚、吕洪涛、魏一晨、孙健。对用于人体姿态估计的多级网络的反思。

C o R R

，abs/1901.00148，2019.2
[39] Yanjie Li, Sen Yang, Shoukui Zhang, Zhicheng Wang, Wankou Yang, Shu-Tao Xia, and Erjin Zhou. Is 2d heatmap representation even necessary for human pose estimation?, 2021. 3, 6
[39] 李艳杰、杨森、张守奎、王志成、杨万口、夏书涛、周尔进。人体姿态估计是否需要二维热图表示？3, 6
[40] Yanjie Li, Shoukui Zhang, Zhicheng Wang, Sen Yang, Wankou Yang, Shu-Tao Xia, and Erjin Zhou. Tokenpose: Learning keypoint tokens for human pose estimation. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 1, 2, 3, 5, 6
[40] 李彦杰、张守奎、王志成、杨森、杨万寇、夏书涛和周尔进。Tokenpose：为人体姿态估计学习关键点标记。IEEE/CVF 计算机视觉国际会议（ICCV），2021 年。1, 2, 3, 5, 6
[41] Ita Lifshitz, Ethan Fetaya, and Shimon Ullman. Human pose estimation using deep consensus voting. In ECCV, pages 246-260, 2016. 2
[41] Ita Lifshitz、Ethan Fetaya 和 Shimon Ullman。使用深度共识投票进行人体姿态估计。在 ECCV 上，第 246-260 页，2016 年。2
[42] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740-755, 2014. 5
[42] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick.微软 COCO：上下文中的通用对象。In ECCV, pages 740-755, 2014.5
[43] Kenkun Liu, Rongqi Ding, Zhiming Zou, Le Wang, and Wei Tang. A comprehensive study of weight sharing in graph networks for 3d human pose estimation. In ECCV, volume 12355, pages 318-334, 2020. 6
[43] Kenkun Liu，Rongqi Ding，Zhiming Zou，Le Wang，and Wei Tang.用于 3D 人体姿态估计的图网络权重共享综合研究。In ECCV, volume 12355, pages 318-334, 2020.6
[44] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer V2: scaling
[44] Ze Liu、Han Hu、Yutong Lin、Zhuliang Yao、Zhenda Xie、Yixuan Wei、Jia Ning、Yue Cao、Zheng Zhang、Li Dong、Furu Wei 和 Baining Guo。斯温变压器 V2：缩放
up capacity and resolution. In CVPR, pages 11999-12009. IEEE, 2022. 5, 6
容量和分辨率。In CVPR, pages 11999-12009.IEEE, 2022.5, 6
[45] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021. 5
[45] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.Swin transformer: Hierarchical vision transformer using shifted windows. ArXiv preprint arXiv:2103.14030, 2021.5
[46] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. 5
[46] Ilya Loshchilov 和 Frank Hutter.解耦权重衰减正则化。学习表征国际会议，2019 年。5
[47] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. CoRR, abs/2206.08916, 2022. 3
[47] Jiasen Lu、Christopher Clark、Rowan Zellers、Roozbeh Mottaghi 和 Aniruddha Kembhavi。Unified-io：视觉、语言和多模态任务的统一模型。CoRR，abs/2206.08916，2022。3
[48] Zhengxiong Luo, Zhicheng Wang, Yan Huang, Liang Wang, Tieniu Tan, and Erjin Zhou. Rethinking the heatmap regression for bottom-up human pose estimation. In CVPR, pages 13264-13273, 2021. 2
[48] 罗正雄、王志成、黄艳、王亮、谭铁牛和周二金。自下而上人体姿态估计的热图回归反思。In CVPR, pages 13264-13273, 2021.2
[49] Weian Mao, Yongtao Ge, Chunhua Shen, Zhi Tian, Xinlong Wang, Zhibin Wang, and Anton van den Hengel. Poseur: Direct human pose regression with transformers. October 2022. 2, 7
[49] Weian Mao，Yongtao Ge，Chunhua Shen，Zhi Tian，Xinlong Wang，Zhibin Wang，and Anton van den Hengel.Poseur：使用变换器的直接人体姿势回归。2022 年 10 月2, 7
[50] Weian Mao, Zhi Tian, Xinlong Wang, and Chunhua Shen. Fcpose: Fully convolutional multi-person pose estimation with dynamic instance-aware convolutions. In CVPR, pages 9034-9043. Computer Vision Foundation / IEEE, 2021. 2
[50] Weian Mao，Zhi Tian，Xinlong Wang，and Chunhua Shen.Fcpose：采用动态实例感知卷积的全卷积多人姿态估计。In CVPR, pages 9034-9043.计算机视觉基金会/IEEE，2021 年。2
[51] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple yet effective baseline for 3d human pose estimation. In ICCV, pages 2659-2668, 2017. 6
[51] Julieta Martinez、Rayat Hossain、Javier Romero 和 James J. Little。简单而有效的 3D 人体姿态估计基线。In ICCV, pages 2659-2668, 2017.6
[52] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Camera distance-aware top-down approach for 3d multiperson pose estimation from a single RGB image. In ICCV, pages 10132-10141, 2019. 6
[52] Gyeongsik Moon、Ju Yong Chang 和 Kyoung Mu Lee。从单张 RGB 图像进行 3d 多人姿态估计的相机距离感知自上而下方法。In ICCV, pages 10132-10141, 2019.6
[53] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Posefix: Model-agnostic general human pose refinement network. In CVPR, pages 7773-7781, 2019. 2
[53] Gyeongsik Moon、Ju Yong Chang 和 Kyoung Mu Lee.Posefix：与模型无关的通用人体姿态细化网络。In CVPR, pages 7773-7781, 2019.2
[54] Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NeurIPS, pages 2274-2284, 2017. 2
[54] Alejandro Newell、Zhiao Huang 和 Jia Deng.关联嵌入：联合检测和分组的端到端学习。In NeurIPS, pages 2274-2284, 2017.2
[55] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In

E C C V

, pages 483-499, 2016. 1, 2
[55] Alejandro Newell、Kaiyu Yang 和 Jia Deng.用于人体姿态估计的堆叠沙漏网络。载于

E C C V

，第 483-499 页，2016 年。1, 2
[56] Xuecheng Nie, Jiashi Feng, Jianfeng Zhang, and Shuicheng Yan. Single-stage multi-person pose machines. In ICCV, 2019. 2
[56] 聂学成、冯家仕、张剑锋、严水成。单级多人姿态机器。In ICCV, 2019.2
[57] Xuecheng Nie, Jiashi Feng, Yiming Zuo, and Shuicheng Yan. Human pose estimation with parsing induced learner. In

C V P R

, 2018. 2
[57] 聂学成、冯家仕、左一鸣、闫水成。利用解析诱导学习器进行人体姿态估计。In

C V P R

, 2018.2
[58] Aude Oliva and Antonio Torralba. The role of context in object recognition. Trends in cognitive sciences, 11(12):520527, 2007. 1
[58] Aude Oliva 和 Antonio Torralba.语境在物体识别中的作用。认知科学趋势，11（12）：520527，2007.1
[59] George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, and Kevin Murphy. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV, pages 282-299, 2018. 2
[59] George Papandreou、Tyler Zhu、Liang-Chieh Chen、Spyros Gidaris、Jonathan Tompson 和 Kevin Murphy.Personlab：使用自下而上、基于部件的几何嵌入模型进行人物姿势估计和实例分割。在 ECCV 上，第 282-299 页，2018 年。2
[60] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in the wild. In

C V P R

, pages 3711-3719, 2017. 1, 2
[60] George Papandreou、Tyler Zhu、Nori Kanazawa、Alexander Toshev、Jonathan Tompson、Chris Bregler 和 Kevin Murphy。实现野外精确的多人姿势估计。载于

C V P R

，第 3711-3719 页，2017 年。1, 2
[61] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753-7762, 2019. 1
[61] Dario Pavllo、Christoph Feichtenhofer、David Grangier 和 Michael Auli。利用时序卷积和半监督训练进行视频中的三维人体姿态估计。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753-7762, 2019.1
[62] Xi Peng, Zhiqiang Tang, Fei Yang, Rogério Schmidt Feris, and Dimitris N. Metaxas. Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In

C V P R

, pages 2226-2234, 2018. 2
[62] Xi Peng，Zhiqiang Tang，Fei Yang，Rogério Schmidt Feris，and Dimitris N. Metaxas.联合优化数据增强和网络训练：人体姿态估计中的对抗性数据增强。载于

C V P R

，第 2226-2234 页，2018 年。2
[63] Leonid Pishchulin, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. Poselet conditioned pictorial structures. In CVPR, pages 588-595, 2013. 3
[63] Leonid Pishchulin, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele.Poselet conditioned pictorial structures.在 CVPR 上，第 588-595 页，2013 年。3
[64] Lingteng Qiu, Xuanye Zhang, Yanran Li, Guanbin Li, Xiaojun Wu, Zixiang Xiong, Xiaoguang Han, and Shuguang Cui. Peeking into occluded joints: A novel framework for crowd pose estimation. In ECCV, pages 488-504, 2020. 3
[64] Lingteng Qiu，Xuanye Zhang，Yanran Li，Guanbin Li，Xiaojun Wu，Zixiang Xiong，Xiaoguang Han，and Shuguang Cui.窥视隐蔽关节：人群姿态估计的新框架。在 ECCV 上，第 488-504 页，2020 年。3
[65] Deva Ramanan. Learning to parse images of articulated objects. NeurIPS, 2006. 1, 3
[65] Deva Ramanan.学习解析铰接物体的图像。NeurIPS, 2006.1, 3
[66] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211-252, 2015. 5
[66] Olga Russakovsky、Jia Deng、Hao Su、Jonathan Krause、Sanjeev Satheesh、Sean Ma、Zhiheng Huang、Andrej Karpathy、Aditya Khosla、Michael S. Bernstein、Alexander C. Berg 和 Fei-Fei Li。Imagenet 大规模视觉识别挑战。国际计算机视觉杂志》，115（3）：211-252，2015.5
[67] Nitin Saini, Elia Bonetto, Eric Price, Aamir Ahmad, and Michael J Black. Airpose: Multi-view fusion network for aerial 3d human pose and shape estimation. IEEE Robotics and Automation Letters, 7(2):4805-4812, 2022.1
[67] Nitin Saini、Elia Bonetto、Eric Price、Aamir Ahmad 和 Michael J Black。Airpose：用于空中三维人体姿态和形状估计的多视角融合网络。电气与电子工程师学会机器人与自动化通讯》，7（2）：4805-4812，2022.1
[68] Luca Schmidtke, Athanasios Vlontzos, Simon Ellershaw, Anna Lukens, Tomoki Arichi, and Bernhard Kainz. Unsupervised human pose estimation through transforming shape templates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2484-2494, 2021. 1
[68] Luca Schmidtke、Athanasios Vlontzos、Simon Ellershaw、Anna Lukens、Tomoki Arichi 和 Bernhard Kainz。通过变换形状模板进行无监督人体姿势估计。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 2484-2494 页，2021 年。1
[69] Taiki Sekii. Pose proposal networks. In ECCV, 2018. 2
[69] Taiki Sekii.Pose proposal networks.In ECCV, 2018.2
[70] Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, Abhishek Sharma, and Arjun Jain. Monocular 3d human pose estimation by generation and ordinal ranking. In ICCV, pages 2325-2334. IEEE, 2019. 6
[70] Saurabh Sharma、Pavan Teja Varigonda、Prashast Bindal、Abhishek Sharma 和 Arjun Jain。通过生成和顺序排列进行单目 3d 人体姿态估计。ICCV，第 2325-2334 页。IEEE，2019。6
[71] Dahu Shi, Xing Wei, Liangqi Li, Ye Ren, and Wenming Tan. End-to-end multi-person pose estimation with transformers. In CVPR, pages 11059-11068. IEEE, 2022. 2
[71] Dahu Shi、Xing Wei、Liangqi Li、Ye Ren 和 Wenming Tan。用变换器进行端到端多人姿态估计。In CVPR, pages 11059-11068.IEEE, 2022.2
[72] Kai Su, Dongdong Yu, Zhenqi Xu, Xin Geng, and Changhu Wang. Multi-person pose estimation with enhanced channel-wise and spatial information. In

C V P R

, pages 5674-5682, 2019. 2
[72] Kai Su、Dongdong Yu、Zhenqi Xu、Xin Geng 和 Changhu Wang。具有增强信道和空间信息的多人姿态估计。载于

C V P R

，第 5674-5682 页，2019 年。2
[73] Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. Human pose estimation using global and local normalization. In ICCV, pages 5600-5608, 2017. 2
[73] Ke Sun、Cuiling Lan、Junliang Xing、Wenjun Zeng、Dong Liu 和 Jingdong Wang。使用全局和局部归一化进行人体姿态估计。In ICCV, pages 5600-5608, 2017.2
[74] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019. 1, 2, 5, 6, 7
[74] Ke Sun、Bin Xiao、Dong Liu 和 Jingdong Wang。用于人体姿态估计的深度高分辨率表示学习。In CVPR, 2019.1, 2, 5, 6, 7
[75] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In Proceedings of the European conference on computer vision (ECCV), pages 529-545, 2018. 1, 2, 6
[75] Xiao Sun，Bin Xiao，Fangyin Wei，Shuang Liang，and Yichen Wei.积分人体姿态回归。欧洲计算机视觉会议（ECCV）论文集》，第 529-545 页，2018 年。1, 2, 6
[76] Zhi Tian, Hao Chen, and Chunhua Shen. Directpose: Direct end-to-end multi-person pose estimation. In

CoRR, 2019.2

[76] Zhi Tian、Hao Chen 和 Chunhua Shen.Directpose：直接端到端多人姿态估计。见

CoRR, 2019.2

。
[77] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An allmlp architecture for vision. In NeurIPS 2021, December 6-14, 2021, virtual, pages 24261-24272, 2021. 4
[77] Ilya O. Tolstikhin、Neil Houlsby、Alexander Kolesnikov、Lucas Beyer、Xiaohua Zhai、Thomas Unterthiner、Jessica Yung、Andreas Steiner、Daniel Keysers、Jakob Uszkoreit、Mario Lucic 和 Alexey Dosovitskiy。Mlp-mixer：用于视觉的 allmlp 架构。In NeurIPS 2021, December 6-14, 2021, virtual, pages 24261-24272, 2021.4
[78] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation.

A d

vances in neural information processing systems, 27, 2014. 2, 3
[78] Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler.卷积网络和图形模型的联合训练用于人体姿态估计。

A d

神经信息处理系统进展，27，2014.2, 3
[79] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In

C V P R

, pages 1653-1660, 2014. 1, 2
[79] Alexander Toshev 和 Christian Szegedy.Deeppose：通过深度神经网络进行人体姿态估计。载于

C V P R

，第 1653-1660 页，2014 年。1, 2
[80] Hanyue Tu, Chunyu Wang, and Wenjun Zeng. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In European Conference on Computer Vision, pages 197-212. Springer, 2020. 1
[80] Hanyue Tu，Chunyu Wang，and Wenjun Zeng.Voxelpose：在野外环境中实现多摄像头三维人体姿态估计。欧洲计算机视觉会议，第 197-212 页。Springer, 2020.1
[81] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017. 4
[81] Aaron Van Den Oord, Oriol Vinyals, et al. 神经离散表征学习。神经信息处理系统进展》，30，2017.4
[82] Ali Varamesh and Tinne Tuytelaars. Mixture dense regression for object detection and human pose estimation. In CVPR, 2020. 2
[82] Ali Varamesh 和 Tinne Tuytelaars.用于物体检测和人体姿态估计的混合密集回归CVPR, 2020.2
[83] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neurips, pages 5998-6008, 2017. 2
[83] Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N. Gomez、Lukasz Kaiser 和 Illia Polosukhin。注意力就是你所需要的一切。In Neurips, pages 5998-6008, 2017.2
[84] Timo Von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3 d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), pages 601-617, 2018. 1
[84] Timo Von Marcard、Roberto Henschel、Michael J Black、Bodo Rosenhahn 和 Gerard Pons-Moll。使用 imus 和移动摄像头在野外恢复精确的 3 d 人体姿态。欧洲计算机视觉会议（ECCV）论文集》，第 601-617 页，2018 年。1
[85] Jian Wang, Xiang Long, Yuan Gao, Errui Ding, and Shilei Wen. Graph-penn: Two stage human pose estimation with graph pose refinement. In

E C C V

, pages

492 - 508, 2020.1

, 2, 3
[85] Jian Wang，Xiang Long，Yuan Gao，Errui Ding，and Shilei Wen.Graph-penn：利用图形姿态细化的两阶段人体姿态估计。载于

E C C V

，第

492 - 508, 2020.1

、2、3页
[86] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-resolution representation learning for visual recognition. TPAMI. 6, 7
[86] 王敬东、孙科、程天恒、蒋博瑞、邓超瑞、赵阳、刘东、穆亚东、谭明奎、王兴刚、刘文宇和肖斌。用于视觉识别的深度高分辨率表示学习。TPAMI.6, 7
[87] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 1, 6
[87] 王敬东、孙科、程天恒、蒋博瑞、邓超瑞、赵阳、刘东、穆亚东、谭明奎、王兴刚、刘文宇和肖斌。用于视觉识别的深度高分辨率表示学习IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.1, 6
[88] Yihan Wang, Muyang Li, Han Cai, Wei-Ming Chen, and Song Han. Lite pose: Efficient architecture design for 2d human pose estimation. In CVPR, pages 13116-13126. IEEE, 2022. 2
[88] Yihan Wang，Muyang Li，Han Cai，Wei-Ming Chen，and Song Han.Lite pose：二维人体姿态估计的高效架构设计。In CVPR, pages 13116-13126.IEEE, 2022.2
[89] Fangyun Wei, Xiao Sun, Hongyang Li, Jingdong Wang, and Stephen Lin. Point-set anchors for object detection, instance segmentation and pose estimation. In

E C C V

, pages 527-544, 2020. 2
[89] Fangyun Wei，Xiao Sun，Hongyang Li，Jingdong Wang，and Stephen Lin.用于物体检测、实例分割和姿态估计的点集锚点。在

E C C V

中，第 527-544 页，2020 年。2
[90] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, pages 4724-4732, 2016. 1, 2
[90] Shih-En Wei、Varun Ramakrishna、Takeo Kanade 和 Yaser Sheikh.卷积姿态机器。In CVPR, pages 4724-4732, 2016.1, 2
[91] Yi Zhang Siyuan Qiao Zihao Xiao Tae Soo Kim Yizhou Wang Alan Yuille Weichao Qiu, Fangwei Zhong. Unrealcv: Virtual worlds for computer vision. ACM Multimedia Open Source Software Competition, 2017. 5
[91] Yi Zhang Siyuan Qiao Zihao Xiao Tae Soo Kim Yizhou Wang Alan Yuille Weichao Qiu, Fangwei Zhong.Unrealcv：用于计算机视觉的虚拟世界。ACM 多媒体开源软件大赛，2017.5
[92] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for human pose estimation and tracking. In

E C C V

, pages 472-487, 2018. 2, 5, 6, 7
[92] Bin Xiao，Haiping Wu，and Yichen Wei.人体姿态估计与跟踪的简单基线。In

E C C V

, pages 472-487, 2018.2, 5, 6, 7
[93] Rongchang Xie, Chunyu Wang, Wenjun Zeng, and Yizhou Wang. An empirical study of the collapsing problem in semi-supervised 2d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11240-11249, 2021. 1, 2
[93] 谢荣昌、王春雨、曾文军、王一舟。半监督二维人体姿态估计中坍塌问题的实证研究。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11240-11249, 2021.1, 2
[94] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: a simple framework for masked image modeling. In CVPR, pages 9643-9653. IEEE, 2022. 4, 5
[94] 谢振达、张铮、曹玥、林宇彤、包建民、姚祝亮、戴琦和胡涵。Simmim：遮挡图像建模的简单框架。In CVPR, pages 9643-9653.IEEE, 2022.4, 5
[95] Tianhan Xu and Wataru Takano. Graph stacked hourglass networks for 3 d human pose estimation. In

C V P R

, pages 16105-16114. Computer Vision Foundation / IEEE, 2021. 6
[95] Tianhan Xu 和 Wataru Takano.用于 3 d 人体姿态估计的图形堆叠沙漏网络。

C V P R

，第 16105-16114 页。计算机视觉基金会/IEEE，2021 年。6
[96] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation, 2022. 1, 6, 7
[96] Yufei Xu，Jing Zhang，Qiming Zhang，and Dacheng Tao.Vitpose：用于人体姿态估计的简单视觉变换器基线，2022.1, 6, 7
[97] Sen Yang, Zhibin Quan, Mu Nie, and Wankou Yang. Transpose: Keypoint localization via transformer. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021. 2, 6
[97] Sen Yang, Zhibin Quan, Mu Nie, and Wankou Yang.Transpose：通过变换器进行关键点定位。In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.2, 6
[98] Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In ICCV, pages 1290-1299, 2017. 2
[98] 杨伟、李爽、欧阳万里、李红生和王小刚。学习人体姿态估计的特征金字塔。In ICCV, pages 1290-1299, 2017.2
[99] Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR, pages 3073-3082, 2016. 3
[99] 杨伟、欧阳万里、李红生、王晓刚。用于人体姿态估计的可变形部件混合和深度卷积神经网络端到端学习。In CVPR, pages 3073-3082, 2016.3
[100] Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, pages 13851392, 2011. 3
[100] Yi Yang 和 Deva Ramanan.用灵活的部件混合物进行关节姿势估计。CVPR，第 13851392 页，2011 年。3
[101] Yiding Yang, Zhou Ren, Haoxiang Li, Chunluan Zhou, Xinchao Wang, and Gang Hua. Learning dynamics via graph neural networks for human pose estimation and tracking. In CVPR, pages 8074-8084, 2021. 3
[101] Yiding Yang，Zhou Ren，Haoxiang Li，Chunluan Zhou，Xinchao Wang，and Gang Hua.通过图神经网络学习动态，实现人体姿态估计和跟踪。In CVPR, pages 8074-8084, 2021.3
[102] Hongwei Yi, Chun-Hao P Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J Black. Human-aware object placement for visual environment reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3959-3970, 2022. 1
[102] Hongwei Yi, Chun-Hao P Huang, Dimitrios Tzionas, Muhammed Kocabas, Mohamed Hassan, Siyu Tang, Justus Thies, and Michael J Black.视觉环境重建中的人体感知物体放置。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 3959-3970 页，2022 年。1
[103] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: Highresolution transformer for dense prediction. 2021. 2, 5, 6, 7
[103] Yuhui Yuan，Rao Fu，Lang Huang，Weihong Lin，Chao Zhang，Xilin Chen，and Jingdong Wang.Hrformer：用于密集预测的高分辨率变换器。2021.2, 5, 6, 7
[104] Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang Xu , and Stephen Lin. Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. In ECCV, volume 12359, pages 507-523, 2020. 6
[104] Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang Xu , and Stephen Lin.Srnet：用拆分重组方法提高三维人体姿态估计的泛化能力。In ECCV, volume 12359, pages 507-523, 2020.6
[105] Feng Zhang, Xiatian Zhu, Hanbin Dai, Mao Ye, and Ce Zhu. Distribution-aware coordinate representation for human pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 2, 6, 7
[105] Feng Zhang，Xiatian Zhu，Hanbin Dai，Mao Ye，and Ce Zhu.用于人体姿态估计的分布感知坐标表示。IEEE/CVF 计算机视觉与模式识别大会（CVPR），2020 年 6 月。2, 6, 7
[106] Hong Zhang, Hao Ouyang, Shu Liu, Xiaojuan Qi, Xiaoyong Shen, Ruigang Yang, and Jiaya Jia. Human pose estimation with spatial contextual information. In

CoRR, 2019

. 3
[106] Hong Zhang, Hao Ouyang, Shu Liu, Xiaojuan Qi, Xiaoyong Shen, Ruigang Yang, and Jiaya Jia.利用空间上下文信息进行人体姿态估计。在

CoRR, 2019

.3
[107] Song-Hai Zhang, Ruilong Li, Xin Dong, Paul L. Rosin, Zixi Cai, Xi Han, Dingcheng Yang, Haozhi Huang, and Shi-Min Hu. Pose2seg: Detection free human instance segmentation. In CVPR, pages 889-898. Computer Vision Foundation / IEEE, 2019. 5, 7
[107] 张松海、李瑞龙、董昕、Paul L. Rosin、蔡子熙、韩曦、杨鼎成、黄浩智、胡世民。Pose2seg：免检测人体实例分割。In CVPR, pages 889-898.计算机视觉基金会/IEEE，2019。5, 7
[108] Zhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and Wenjun Zeng. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. IJCV, pages 116, 2020. 5, 7
[108] Zhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and Wenjun Zeng.Adafuse：野外精确人体姿态估计的自适应多视图融合。IJCV，第 116 页，2020 年。5, 7
[109] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N Metaxas. Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3425-3435, 2019.6
[109] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N Metaxas.用于三维人体姿态回归的语义图卷积网络。In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3425-3435, 2019.6
[110] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. In CoRR, 2019. 1, 2
[110] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl.物体即点。In CoRR, 2019.1, 2
[111] Zhiming Zou and Wei Tang. Modulated graph convolutional network for 3d human pose estimation. In ICCV, pages 11457-11467. IEEE, 2021. 6
[111] 邹志明和唐伟。用于 3d 人体姿态估计的调制图卷积网络。ICCV，第 11457-11467 页。IEEE, 2021.6

*Equal Advising *平等咨询

Human Pose as Compositional Tokens 作为合成标记的人类姿势

Abstract 摘要

1. Introduction 1.导言

2. Related works 2.相关作品

2.1. Pose representations2.1.姿势表示法

2.2. Modeling joint dependency2.2.联合依赖性建模

3. Pose as Compositional Tokens3.作为合成标记的姿势

3.1. Learning compositional tokens3.1.学习构图标记

3.2. Human Pose Estimation3.2.人体姿态估计

4. Experiments 4.实验

4.1. Datasets and metrics4.1.数据集和指标

4.2. Implementation details4.2.实施细节

4.3. Results on COCO, MPII and H36M4.3.COCO、MPII 和 H36M 的结果

4.4. Results on CrowdPose, OCHuman, SyncOCC4.4.CrowdPose、OCHuman、SyncOCC 的结果

4.5. Empirical analysis 4.5.经验分析

5. Conclusion 5.结论

References 参考资料

Human Pose as Compositional Tokens
作为合成标记的人类姿势

2.1. Pose representations
2.1.姿势表示法

2.2. Modeling joint dependency
2.2.联合依赖性建模

3. Pose as Compositional Tokens
3.作为合成标记的姿势

3.1. Learning compositional tokens
3.1.学习构图标记

3.2. Human Pose Estimation
3.2.人体姿态估计

4.1. Datasets and metrics
4.1.数据集和指标

4.2. Implementation details
4.2.实施细节

4.3. Results on COCO, MPII and H36M
4.3.COCO、MPII 和 H36M 的结果

4.4. Results on CrowdPose, OCHuman, SyncOCC
4.4.CrowdPose、OCHuman、SyncOCC 的结果