One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
利用组件感知变压器进行单级三维全身网格复原

Jing Lin^1,2§, Ailing Zeng^1¶, Haoqian Wang², Lei Zhang¹, Yu Li¹
¹ International Digital Economy Academy (IDEA),
¹ 国际数字经济学院（IDEA）、
² Shenzhen International Graduate School, Tsinghua University
² 清华大学深圳国际研究生院
https://osx-ubody.github.io

Abstract 内容摘要

Whole-body mesh recovery aims to estimate the 3D human body, face, and hands parameters from a single image. It is challenging to perform this task with a single network due to resolution issues, i.e., the face and hands are usually located in extremely small regions. Existing works usually detect hands and faces, enlarge their resolution to feed in a specific network to predict the parameter, and finally fuse the results. While this copy-paste pipeline can capture the fine-grained details of the face and hands, the connections between different parts cannot be easily recovered in late fusion, leading to implausible 3D rotation and unnatural pose. In this work, we propose a one-stage pipeline for expressive whole-body mesh recovery, named OSX, without separate networks for each part. Specifically, we design a Component Aware Transformer (CAT) composed of a global body encoder and a local face/hand decoder. The encoder predicts the body parameters and provides a high-quality feature map for the decoder, which performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely. The whole pipeline is simple yet effective without any manual post-processing and naturally avoids implausible prediction. Comprehensive experiments demonstrate the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset (UBody) with high-quality 2D and 3D whole-body annotations. It contains persons with partially visible bodies in diverse real-life scenarios to bridge the gap between the basic task and downstream applications. ^†^† $\S$ Work done during an internship at IDEA; ${\P}$ Corresponding author.
全身网格复原的目的是从单张图像中估算三维人体、面部和手部参数。由于分辨率问题，即脸部和手部通常位于极小的区域内，因此用单一网络来完成这项任务具有挑战性。现有的工作通常是检测手和脸，放大它们的分辨率以输入特定的网络来预测参数，最后将结果融合。虽然这种复制粘贴的方法可以捕捉到脸部和手部的细微细节，但在后期融合时，不同部分之间的联系却不容易恢复，从而导致难以置信的三维旋转和不自然的姿势。在这项工作中，我们提出了一种名为 OSX 的单阶段管道，用于恢复具有表现力的全身网格，而无需为每个部分单独建立网络。具体来说，我们设计了一个由全局身体编码器和局部面部/手部解码器组成的组件感知变换器（CAT）。编码器预测身体参数并为解码器提供高质量的特征图，解码器执行特征级上采样裁剪方案以提取高分辨率的特定部位特征，并采用关键点引导的可变形注意力来精确估计手部和面部。整个管道简单而有效，无需任何人工后处理，自然避免了难以置信的预测。综合实验证明了 OSX 的有效性。最后，我们建立了一个大规模的上半身数据集（UBody），其中包含高质量的二维和三维全身注释。该数据集包含各种现实场景中身体部分可见的人，从而弥补了基本任务与下游应用之间的差距。 ^†

Figure 1: A comparison of existing whole-body mesh recovery methods and ours. Most existing methods leverage a multi-stage pipeline which uses separate expert models to process body component (e.g., E1: HeadNet, E2: HandNet, E3: BodyNet) and fuse them to get the whole-body prediction in a copy-paste manner. The result (from [48]) produces unnatural wrist poses. In contrast, our pipeline is a neat one-stage framework with a single encoder-decoder and can predict more accurately with natural meshes.
图 1：现有的全身网格复原方法与我们的方法的比较。大多数现有方法都采用了多级管道，使用不同的专家模型来处理身体组件（例如，E1：HeadNet、E2：HandNet、E3：BodyNet），然后以复制粘贴的方式将它们融合以获得全身预测。结果（摘自[48]）产生了不自然的手腕姿势。相比之下，我们的管道是一个简单的单阶段框架，只需一个编码器-解码器，就能更准确地预测自然网格。

1 Introduction 1引言

Expressive whole-body mesh recovery aims to jointly estimate the 3D human body poses, hand gestures, and facial expressions from monocular images. It is gaining increasing attention due to recent advancements in whole-body parametric models (e.g., SMPL-X [49]). This task is a key step in modeling human behaviors and has many applications, e.g., motion capture, human-computer interaction. Previous research focus on individual tasks of reconstructing human body [28, 32, 67, 68, 11, 61], face [2, 60, 12, 15], or hand [4, 22, 10]. However, whole body mesh recovery is particularly challenging as it requires accurate estimation of each part and natural connections between them.
表现性全身网格复原旨在从单目图像中联合估计三维人体姿势、手势和面部表情。由于近来全身参数模型（如 SMPL-X [ 49]）的进步，这项任务越来越受到关注。这项任务是人类行为建模的关键步骤，有很多应用领域，如动作捕捉、人机交互等。以往的研究主要集中在重建人体[28, 32, 67, 68, 11, 61]、面部[2, 60, 12, 15]或手部[4, 22, 10]等单项任务上。然而，全身网格复原尤其具有挑战性，因为它需要精确估计每个部位以及它们之间的自然连接。

Existing learning-based works [48, 16, 56, 39, 70] use multi-stage pipelines for body, hand, and face estimation to achieve the goal of this task. As depicted in Figure 1(a), these methods typically detect different body parts, crop and resize each region, and feed them into separate expert models to estimate the parameters of each part. The multi-stage pipeline with different estimators for body, hand, and face results in a complicated system with a large computational complexity. Moreover, the blocked communications among different components inevitably cause incompatible configurations, unnatural articulation of the mesh, and implausible 3D wrist rotations as they cannot obtain informative and consistent clues from other components. Some methods [16, 39, 70] attempt to alleviate these issues by designing additional complicated integration schemes or elbow-twist compensation fusion among individual body parts. However, these approaches can be regarded as a late fusion strategy and thus have limited ability to enhance each other and correct implausible predictions.
现有的基于学习的作品[ 48、16、56、39、70] 使用多阶段管道进行身体、手部和面部估算，以实现这项任务的目标。如图 1(a) 所示，这些方法通常会检测不同的身体部位，裁剪并调整每个区域的大小，然后将其输入到不同的专家模型中，以估算每个部位的参数。身体、手部和面部的多级管道和不同的估算器导致系统复杂，计算复杂度高。此外，不同组件之间的通信受阻不可避免地会导致不兼容的配置、不自然的网格衔接和难以置信的三维手腕旋转，因为它们无法从其他组件获得信息和一致的线索。一些方法[16, 39, 70]试图通过设计额外的复杂集成方案或在单个身体部位之间进行肘扭补偿融合来缓解这些问题。然而，这些方法可被视为一种后期融合策略，因此在相互增强和纠正难以置信的预测方面能力有限。

In this work, we propose a one-stage framework named OSX for 3D whole-body mesh recovery, as shown in Figure 1(b), which does not require separate networks for each part. Inspired by recent advancements in Vision Transformers [13, 64], which are effective in capturing spatial information in a plain architecture, we design our pipeline as a component-aware Transformer (CAT) composed of a global body encoder and a local component-specific decoder. The encoder equipped with body tokens as inputs captures the global correlation, predicts the body parameters, and simultaneously provides high-quality feature map for the decoder. The decoder utilizes a differentiable upsample-crop scheme to extract part-specific high-resolution features and adopt the keypoint-guided deformable attention to precisely locate and estimate hand and face parameters. The proposed pipeline is simple yet effective without any manual post-processing. To the best of our knowledge, this is the first one-stage pipeline for 3D whole-body estimation. We conduct comprehensive experiments to investigate the effects of the above designs and compare our method, with existing works on three benchmarks. Results show that OSX outperforms the state-of-the-art (SOTA) [39] by $9.5$ % on AGORA, $7.8$ % on EHF, and $13.4$ % on the body-only 3DPW dataset.
在这项工作中，我们提出了一种名为 OSX 的单阶段框架，用于三维全身网格复原，如图 1(b)所示，它不需要为每个部分单独建立网络。受视觉变换器（Vision Transformers）[13, 64]的最新进展的启发，我们将管道设计为组件感知变换器（Component-aware Transformer，CAT），由一个全局身体编码器和一个局部特定组件解码器组成。编码器以人体标记为输入，捕捉全局相关性，预测人体参数，同时为解码器提供高质量的特征图。解码器利用可变上采样裁剪方案提取特定部位的高分辨率特征，并采用关键点引导的可变形注意力来精确定位和估计手部和面部参数。所提出的管道简单而有效，无需任何人工后处理。据我们所知，这是首个用于三维全身估算的单阶段管道。我们进行了全面的实验来研究上述设计的效果，并在三个基准上将我们的方法与现有作品进行比较。结果表明，在 AGORA 数据集上，OSX 比最先进的方法（SOTA）[39] 高出 @0% ；在 EHF 数据集上，OSX 比最先进的方法高出 @1% ；在纯身体 3DPW 数据集上，OSX 比最先进的方法高出 @2% 。

In addition, existing popular benchmarks, as illustrated in the first row of Figure 2, are either indoor single-person scenes with limited images (e.g., EHF [49]) or outdoor synthetic scenes (e.g., AGORA [47]), where the people are often too far from the camera and the hands and faces are frequently obscured. In fact, human pose estimation and mesh recovery is a fundamental task that benefits many downstream applications, such as sign language recognition, gesture generation, and human-computer interaction. Many scenarios, such as talk shows and online classes, are of vital importance to our daily life yet under-explored. In such scenarios, the upper body is a major focus, whereas the hand and face are essential for analysis. To address this issue, we build a large-scale upper-body dataset with fifteen human-centric real-life scenes, as shown in Figure 2(f) to (t). This dataset contains many unseen poses, diverse appearances, heavy truncation, interaction, and abrupt shot changes, which are quite different from previous datasets. Accordingly, we design a systematical annotation pipeline and provide precise 2D whole-body keypoint and 3D whole-body mesh annotations. With this dataset, we perform a comprehensive benchmarking of existing whole-body estimators.
此外，现有的流行基准（如图 2 第一行所示）要么是图像有限的室内单人场景（如 EHF [ 49] ），要么是室外合成场景（如 AGORA [ 47]），在这些场景中，人往往离摄像机太远，手和脸经常被遮挡。事实上，人体姿态估计和网格恢复是一项基本任务，对手语识别、手势生成和人机交互等许多下游应用都有好处。许多场景，如脱口秀和在线课堂，对我们的日常生活至关重要，但尚未得到充分探索。在这些场景中，上半身是重点，而手和脸则是分析的关键。为了解决这个问题，我们建立了一个大规模的上半身数据集，其中包含 15 个以人为中心的真实生活场景，如图 2（f）至（t）所示。该数据集包含许多未见姿势、多种多样的外观、严重的截断、交互和突然的镜头变化，这与以前的数据集有很大不同。因此，我们设计了一个系统的注释管道，提供精确的二维全身关键点和三维全身网格注释。利用该数据集，我们对现有的全身估计器进行了全面的基准测试。

Our contributions can be summarized as follows.
我们的贡献可归纳如下

•

We propose a one-stage pipeline, OSX, for 3D whole-body mesh recovery, which can regress the SMPL-X parameters in a simple yet effective manner.

- 我们提出了一种用于三维全身网格复原的单级管道 OSX，它能以简单而有效的方式回归 SMPL-X 参数。
•

Despite the conceptual simplicity of our one-stage framework, it achieves the new state of the art on three popular benchmarks.

- 尽管我们的单阶段框架概念简单，但它在三个流行基准上达到了新的技术水平。
•

We build a large-scale upper-body dataset, UBody, to bridge the gap between the basic task and downstream applications and provide precise annotations, with which we conduct benchmarking of existing methods. We hope UBody can inspire new research topics.

- 我们建立了一个大规模的上半身数据集 UBody，以弥补基本任务与下游应用之间的差距，并提供精确的注释，以此对现有方法进行基准测试。我们希望 UBody 能激发新的研究课题。

Refer to caption — Figure 2: Illustration of five previous datasets (from (a) to (e)) and the proposed Upper Body Dataset (from (f) to (t)) with fifteen real-life scenes. *UBody* bridges the gap between the basic 3D whole-body estimation task and downstream tasks with highly expressive actions.
图 2：用 15 个真实生活场景展示之前的五个数据集（从（a）到（e））和提议的上半身数据集（从（f）到（t））。UBody 弥补了基本三维全身估算任务与下游高表现力动作任务之间的差距。

2 Related Work 2 相关工作

2.1 Methods of Whole-body Mesh Recovery
2.1 全身网格复原方法

Whole-body mesh recovery targets to localize mesh vertices of all human components, including body, hands, and face from monocular images. Most previous works focus only on individual hand [4, 22, 10], face [2, 60, 12, 15], and body [32, 67, 61, 30, 31] reconstruction. In contrast, the joint whole-body estimation methods are less addressed. Some optimization-based works reconstruct 3D bodies by fitting the detected 2D keypoints from images with additional constraints, but they are slow and prone to local optima [49, 63]. Thanks to the whole-body parametric model (e.g., SMPL-X [49]), learning-based models [48, 56, 74, 16, 59] emerge to train networks to predict expressive body pose, shape, hand gesture, and facial expression. Due to the low resolution of hands and face, these whole-body methods crop and resize the hands and face images to higher resolutions and feed them into separate expert networks to conduct the corresponding parameter regression. Specifically, ExPose [48] introduces body-driven attention for higher-resolution crops of the face and hand estimation, a dedicated refinement module, and part-specific knowledge from existing hand-only and face-only datasets. FrankMocap [56] presents a regression-and-integration method to build a fast and accurate system. PIXIE [16] produces animatable whole body with realistic facial details via a moderator to fuse body part features adaptively. Recently, Hand4Whole [39] utilizes both body and hand joint features for accurate 3D wrist rotation and smooth connection between body and hands.
全身网格复原的目标是从单目图像中定位人体所有组成部分的网格顶点，包括身体、手和脸。之前的大多数研究都只关注单个手部[4, 22, 10]、面部[2, 60, 12, 15]和身体[32, 67, 61, 30, 31]的重建。相比之下，对全身联合估计方法的研究较少。一些基于优化的方法通过对图像中检测到的二维关键点进行拟合，再加上额外的约束条件来重建三维人体，但这些方法速度较慢，而且容易出现局部最优[49, 63]。得益于全身参数模型（如 SMPL-X [ 49]），基于学习的模型 [ 48, 56, 74, 16, 59] 应运而生，用于训练网络以预测富有表现力的身体姿势、形状、手势和面部表情。由于手部和面部的分辨率较低，这些全身方法将手部和面部图像裁剪并调整到更高分辨率，然后将其输入到单独的专家网络中，进行相应的参数回归。具体来说，ExPose [ 48] 引入了身体驱动的注意力，用于对脸部和手部进行更高分辨率的裁剪和估算，还引入了专门的细化模块，并从现有的纯手部和纯脸部数据集中引入了特定部位的知识。FrankMocap [ 56] 提出了一种回归与整合方法，以建立一个快速、准确的系统。PIXIE [ 16] 通过调节器自适应地融合身体部位特征，制作出具有逼真面部细节的全身动画。最近，Hand4Whole [ 39] 利用身体和手部关节特征实现了精确的三维手腕旋转和身体与手部之间的平滑连接。

Nevertheless, these methods aim at high performance by using separate networks in a divide-and-conquer fashion for different components and a specific fusion module to paste them together. The multi-stage pipelines lead to high complexity and inevitably cause inconsistent and unnatural articulation of the mesh and implausible 3D wrist rotations, especially in occluded, truncated, and blurry contexts. Until now, one-stage methods in this task are unexplored.
尽管如此，这些方法仍以高性能为目标，对不同组件采用分而治之的独立网络，并使用特定的融合模块将它们粘贴在一起。多阶段流水线会导致很高的复杂性，不可避免地会造成网格衔接不一致、不自然，以及三维手腕旋转不真实，尤其是在遮挡、截断和模糊的情况下。到目前为止，在这项任务中还没有探索过单级方法。

2.2 Benchmarks of Expressive Body
2.2 表达体的基准

Some datasets with parametric model annotations [47, 49, 62, 6, 23, 40, 26] have been developed to advance the field. Table 1 summarizes these datasets from the annotation type, size, scene diversity, etc. To be specific, EHF [49] is the first evaluation dataset for SMPL-X-based models, which is built by capturing 3D body shapes with a scanning system and then fitting the SMPL-X model to the scans. AGORA [47] is a synthetic dataset with high realism and accurate ground truth, which is by far the most commonly used test data due to the diversity of subjects, environments, clothes, and occlusions. Notably, people in AGORA are often far from the camera, and their hands and face are obscured and have small resolutions, making existing methods focus more on body rather than hand and face estimation.
为了推动这一领域的发展，已经开发了一些带有参数模型注释的数据集[ 47, 49, 62, 6, 23, 40, 26]。表 1 从注释类型、规模、场景多样性等方面总结了这些数据集。具体来说，EHF [ 49] 是第一个基于 SMPL-X 模型的评估数据集，它是通过扫描系统捕捉三维人体形状，然后将 SMPL-X 模型与扫描结果拟合而建立的。AGORA [ 47] 是一个具有高真实度和精确地面实况的合成数据集，由于主体、环境、服装和遮挡物的多样性，它是迄今为止最常用的测试数据。值得注意的是，AGORA 中的人物通常离摄像机较远，手和脸部被遮挡且分辨率较小，这使得现有方法更侧重于身体而非手和脸部的估计。

Since marker-based 3D mocap labels are hard to obtain, there are a few annotation methods [40, 49, 16, 43, 55, 50] for high-precision labeling for both monocular indoor and outdoor scenes. FBA [55] emphasizes the severe failure cases of existing body recovery methods on consumer video data due to unusual camera viewpoints and aggressive truncations. They annotate pseudo 2D body keypoints and SMPL annotations via HMR [28] on 13k frames across four action recognition datasets. Multi-shot-AVA [50] also argues that data from edited media, like movies with rich appearances, interactions between humans, and various temporal contexts, is valuable. They apply the proposed multi-shot optimization on AVA [18] to get pseudo 3D ground truth. Interestingly, a body recovery benchmark [46] finds that simply using the 2D COCO dataset with pseudo-3D labels can surprisingly achieve a better performance and generalization ability. To complement these prior datasets and focus on expressive body recovery, we construct a new benchmark with high-quality 2D and 3D whole-body annotations.
由于基于标记的 3D mocap 标签难以获得，因此有一些标注方法[40, 49, 16, 43, 55, 50]可用于单目室内和室外场景的高精度标注。FBA [ 55] 强调了现有的人体恢复方法在消费类视频数据上的严重失效情况，原因是相机视角不寻常和截断过度。他们通过 HMR [ 28] 在四个动作识别数据集的 13k 帧上标注了伪 2D 身体关键点和 SMPL 注释。多镜头-AVA[ 50]也认为，来自经过编辑的媒体（如具有丰富外观、人与人之间的互动和各种时间背景的电影）的数据很有价值。他们在 AVA [ 18] 上应用了所提出的多镜头优化技术，以获得伪三维地面实况。有趣的是，一项人体复原基准研究[46] 发现，只需使用带有伪三维标签的二维 COCO 数据集，就能出人意料地获得更好的性能和泛化能力。为了补充这些先前的数据集，并专注于富有表现力的人体复原，我们构建了一个具有高质量二维和三维全身注释的新基准。

Type 类型	Dataset 数据集	#Frames	Scenes 场景	Multi 多种 Person 个人	In-the- 在 wild 狂野	Upper 重试错误原因 Body 身体	Video 视频	Annotation 注释 Type 类型	Annotation 注释 Source 资料来源
Rendered 渲染	AGORA [47] 阿戈拉[ 47］	17K	Daily 每日	Y	N	N	N	SMPL-X	[47]
Marker/Sensor- 标记/传感器 based MoCap 基于 MoCap	Human3.6M [23] 人类3.6M [ 23］	3.6M	Daily 每日	N	N	N	Y	SMPL-X	[40]
Marker/Sensor- 标记/传感器 based MoCap 基于 MoCap	3DPW [62] 3DPW [ 62］	$>$ 51K	Daily 每日	Y	Y	N	Y	SMPL-X	[40]
Marker-less 无标记 Multi-view 多视角 MoCap	MPI-INF-3DHP [37] MPI-INF-3DHP [ 37］	$>$ 1.3M	Daily 每日	N	Y	N	Y	SMPL-X	[40]
	EHF [49] EHF [ 49］	0.1K	Daily 每日	N	N	N	N	SMPL-X	[49]
	ZJU-MoCap [51]	$\geq$ 237K 重试错误原因	Daily 每日	N	N	N	Y	SMPL-X	[1]
Pseudo- 伪 3D Labels 3D 标签	PennAction [72] 宾州行动[ 72］	77K	Fitness 健身	N	Y	N	Y	SMPL	[71]
	MSCOCO [35] MSCOCO [ 35］	200K	Daily 每日	Y	Y	N	N	SMPL-X	[40]
	COCO-Wholebody [24] COCO - 整体 [ 24］	200K	Daily 每日	Y	Y	N	N	2D KPT	[24]
	MPII [3] 重大计划 II [ 3］	25K	Daily 每日	Y	Y	N	N	SMPL-X	[40]
	MTP [42]	3.8K	Daily 每日	N	Y	N	N	SMPL-X	[42]
	FBA [55] FBA [ 55］	13K	Vlog&Cook&Daily	Y	Y	N	Y	SMPL	[55]
	Multi-shot-AVA [50] 多镜头-AVA [ 50］	350K	Movie 电影	Y	Y	N	Y	SMPL	[50]
	UBody (Ours) 人体（我们的）	$>$ 1051K	Real-life Scenes 真实场景	Y	Y	Y	Y	SMPL-X&2D KPT	Ours 我们的

Table 1: Comparison of related datasets. UBody is a large-scale upper-body dataset with high-precision whole-body annotations.
表 1：相关数据集比较。UBody 是一个具有高精度全身注释的大型上半身数据集。

3 Method 3 方法

Method 方法	AGORA-val			EHF
Method 方法	Hand 手	Face 面孔	All 全部	Hand 手	Face 面孔	All 全部
Ori. 奥拉	73.3	81.4	183.8	42.7	25.7	77.5
Ori.+1/4 Hand	75.7	80.8	183.0	50.9	24.8	78.5
Ori.+1/4 Face	73.2	81.8	184.0	41.3	24.2	77.0
Share Backbone 共享骨干网	81.1	91.0	202.3	55.5	33.5	84.7
Share+1/4 Hand 份额+1/4 手	77.4	86.1	188.6	57.0	25.8	84.8
Share+1/4 Face 份额+1/4 面	79.5	85.0	196.7	57.8	24.4	82.5

Table 2: A preliminary study on the effect of different component scales and share backbone for all components’ feature extraction.
表 2：不同组件比例和共享骨干对所有组件特征提取影响的初步研究。

3.1 Motivation 3.1 动机

A one-stage framework is vital to simplify the cumbersome processes without hand-craft and complex integration designs. However, translating from multi-stage methods directly to a one-stage method is nontrivial. We take the present state-of-the-art method Hand4Whole [39] as an example to perform some preliminary studies on bringing the gap between the multi-stage method and one-stage approach. On the one hand, we replace its separate backbones with a shared backbone for all human components. On the other hand, we explore different crop-and-resize image resolutions for the hands and face, as they usually have small image resolutions.
单阶段框架对于简化繁琐的流程，避免手工制作和复杂的集成设计至关重要。然而，将多阶段方法直接转换为单阶段方法并非易事。我们以目前最先进的方法 Hand4Whole [ 39] 为例，对缩小多阶段方法和单阶段方法之间的差距进行了初步研究。一方面，我们用所有人类组件的共享主干取代了其独立的主干。另一方面，由于手部和面部的图像分辨率通常较小，我们对其进行了不同的裁剪和调整。

Table 2 shows that, when we transition from the original setup (Ori.) to a shared backbone (Share Backbone), all recovery errors are severely deteriorated on two datasets. Specifically, MPVPE increases from $183.8$ mm to $202.3$ mm (a 10.1% drop) on AGORA [47], and from $77.5$ mm to $84.7$ mm (a 9.3% drop) on EHF for all components (All). These results indicate that extracting the multi-component whole-body features with a shared backbone is difficult. Notably, the hand estimation performance deteriorates by 30.0% on EHF. Based on the results of different resolutions, we summarize some interesting observations as follows: (i) Overall, changing the resolution of the hand results in a larger performance drop than the face on EHF; (ii) When not sharing a backbone, the results are generally worse with smaller input resolutions of the hands and face.
表 2 显示，当我们从原始设置 (Ori.) 过渡到共享主干 (Share Backbone) 时，两个数据集上的所有恢复误差都严重恶化。具体来说，在 AGORA[47]上，MPVPE 从 $183.8$ mm 增加到 $202.3$ mm（下降 10.1%）；在 EHF 上，所有组件（All）的 MPVPE 从 $77.5$ mm 增加到 $84.7$ mm（下降 9.3%）。这些结果表明，提取具有共享骨干的多分量全身特征非常困难。值得注意的是，EHF 的手部估计性能下降了 30.0%。根据不同分辨率的结果，我们总结了以下一些有趣的观察结果：(i) 总体而言，在 EHF 上，改变手的分辨率比改变脸的分辨率导致的性能下降幅度更大；(ii) 在不共享主干的情况下，手和脸的输入分辨率越小，结果越差。

3.2 Building Component Aware Transformer
3.2 构建组件感知变压器

As an attempt to break the above status quo, we propose a one-stage framework with a vision transformer encoder and decoder for expressive full-body mesh recovery, named OSX. It is simple in design and effective in full-body mesh prediction, as we will demonstrate later. We hope it can serve as a baseline for future one-stage methods. Given a human image $\mathbf{I}\in\mathbb{R}^{H\times W\times 3}$ , our component-aware Transformer (CAT) estimates the corresponding body, hand, and face parameters $\hat{\mathcal{P}}=\{\hat{\mathbf{P}}_{body},\hat{\mathbf{P}}_{lhand},\hat{\mathbf{P}}_{rhand},\hat{\mathbf{P}}_{face}\}$ and then feed them into a SMPL-X layer [49] to obtain the final 3D whole-body human mesh. Specifically, $\hat{\mathbf{P}}_{body}$ contains 3D body joint rotation $\theta_{body}\in\mathbb{R}^{22\times 3}$ , body shape $\beta\in\mathbb{R}^{10}$ , and 3D global translation $t\in\mathbb{R}^{3}$ . For $\hat{\mathbf{P}}_{lhand}$ and $\hat{\mathbf{P}}_{rhand}$ , they have 3D left and right hand joint rotation $\theta_{lhand}\in\mathbb{R}^{15\times 3}$ and $\theta_{rhand}\in\mathbb{R}^{15\times 3}$ , respectively. $\hat{\mathbf{P}}_{face}$ consists of 3D jaw rotation $\theta_{face}\in\mathbb{R}^{3}$ and facial expression $\phi\in\mathbb{R}^{10}$ . Our training target is to minimize the distance between the recovered parameters $\hat{\mathcal{P}}$ and the ground-truth parameters $\mathcal{P}$ . As shown in Figure 3, the proposed CAT consists of a component-aware encoder to capture the global correlation and extract high-quality multi-scale feature, and a component-aware decoder to strengthen the hand and face regression via an up-sampling strategy to obtain higher-resolution feature maps.
为了打破上述现状，我们提出了一个带有视觉变换器编码器和解码器的单级框架，用于进行富有表现力的全身网格恢复，命名为 OSX。正如我们稍后将演示的那样，它设计简单，在全身网格预测方面效果显著。我们希望它能成为未来单阶段方法的基线。给定人体图像 $\mathbf{I}\in\mathbb{R}^{H\times W\times 3}$ 后，我们的组件感知变换器（CAT）会估算出相应的身体、手部和面部参数 $\hat{\mathcal{P}}=\{\hat{\mathbf{P}}_{body},\hat{\mathbf{P}}_{lhand},\hat{\mathbf{P}}_{rhand},\hat{\mathbf{P}}_{face}\}$ ，然后将其输入 SMPL-X 层[49]，以获得最终的三维全身人体网格。具体来说， $\hat{\mathbf{P}}_{body}$ 包含三维身体关节旋转 $\theta_{body}\in\mathbb{R}^{22\times 3}$ 、身体形状 $\beta\in\mathbb{R}^{10}$ 和三维全局平移 $t\in\mathbb{R}^{3}$ 。对于 $\hat{\mathbf{P}}_{lhand}$ 和 $\hat{\mathbf{P}}_{rhand}$ ，它们分别包含三维左右手关节旋转 $\theta_{lhand}\in\mathbb{R}^{15\times 3}$ 和 $\theta_{rhand}\in\mathbb{R}^{15\times 3}$ 。 $\hat{\mathbf{P}}_{face}$ 包括三维下颌旋转 $\theta_{face}\in\mathbb{R}^{3}$ 和面部表情 $\phi\in\mathbb{R}^{10}$ 。我们的训练目标是使恢复的参数 $\hat{\mathcal{P}}$ 与地面实况参数 $\mathcal{P}$ 之间的距离最小。如图 3 所示，拟议的 CAT 包括一个组件感知编码器，用于捕捉全局相关性并提取高质量的多尺度特征；以及一个组件感知解码器，用于通过上采样策略加强手部和面部回归，从而获得更高分辨率的特征图。

3.3 Body Regression via Global Encoder
3.3 通过全局编码器进行人体回归

In the component-aware encoder, the human image $\mathbf{I}$ is split into fixed-size image patches $\mathbf{P}\in\mathbb{R}^{\frac{HW}{M^{2}}\times(M^{2}\times 3)}$ , where $M$ is the patch size. The patches $\mathbf{P}$ are then linearly projected by a convolution layer and added with position embeddings $\mathbf{P_{e}}\in\mathbb{R}^{\frac{HW}{M^{2}}\times C}$ to obtain a sequence of feature tokens $\mathbf{T_{f}}\in\mathbb{R}^{\frac{HW}{M^{2}}\times C}$ . To explicitly leverage the body prior and learn the body information in the encoder, we concatenate the feature token $\mathbf{T_{f}}$ with the body tokens $\mathbf{T_{b}}\in\mathbb{R}^{B\times C}$ , which are learnable parameters. The concatenated tokens are then fed into a standard Transformer encoder with multiple Transformer blocks [13]. Each block consists of a multi-head self-attention, a feed-forward network (FFN), and two layer normalization. After the global feature fusion, the body tokens and image feature tokens are updated into $\mathbf{T_{b}}^{\prime}\in\mathbb{R}^{B\times C}$ and $\mathbf{T_{f}}^{\prime}\in\mathbb{R}^{\frac{HW}{M^{2}}\times C}$ . Finally, we use several fully connected layers to regress the body parameters $\hat{\mathbf{P}}_{body}=\{\theta_{body},\beta,t\}$ based on $\mathbf{T_{b}}^{\prime}$ .
在分量感知编码器中，人类图像 $\mathbf{I}$ 被分割成固定大小的图像补丁 $\mathbf{P}\in\mathbb{R}^{\frac{HW}{M^{2}}\times(M^{2}\times 3)}$ ，其中 $M$ 是补丁大小。然后，通过卷积层对这些斑块 $\mathbf{P}$ 进行线性投影，并添加位置嵌入 $\mathbf{P_{e}}\in\mathbb{R}^{\frac{HW}{M^{2}}\times C}$ 以获得特征标记序列 $\mathbf{T_{f}}\in\mathbb{R}^{\frac{HW}{M^{2}}\times C}$ 。为了在编码器中明确利用身体先验和学习身体信息，我们将特征标记 $\mathbf{T_{f}}$ 与身体标记 $\mathbf{T_{b}}\in\mathbb{R}^{B\times C}$ 连接起来，后者是可学习的参数。然后，将并集的标记送入带有多个变换器块的标准变换器编码器[13]。每个区块由多头自注意、前馈网络（FFN）和两层归一化组成。全局特征融合后，人体标记和图像特征标记被更新为 $\mathbf{T_{b}}^{\prime}\in\mathbb{R}^{B\times C}$ 和 $\mathbf{T_{f}}^{\prime}\in\mathbb{R}^{\frac{HW}{M^{2}}\times C}$ 。最后，我们使用多个全连接层来回归基于 $\mathbf{T_{b}}^{\prime}$ 的人体参数 $\hat{\mathbf{P}}_{body}=\{\theta_{body},\beta,t\}$ 。

3.4 High-Resolution Decoder for Hand and Face
3.4 用于手部和面部的高分辨率解码器

Up-sampling for multi-scale high-resolution features. Since the hands and face in a human image are usually small, previous methods upsample the human image and crop out the hands and face to obtain higher-resolution images. However, this image-level upsampling-crop scheme requires additional backbones to extract the hand and face features separately. To solve this problem, we propose a differentiable feature-level upsampling-crop strategy to enhance the hands and face regression process as inspired by the recent ViTDet [34]. Specifically, we reshape the feature tokens $\mathbf{T_{f}}^{\prime}$ into a feature map and upsample it into multiple higher-resolution features $\mathbf{T}_{hr}$ via deconvolution layers. Then, since decoding the hand and face component information from the full feature map inevitably leads to redundant computation and makes the computation process inefficient, we perform differentiable RoIAlign [21] on the feature maps and crop out multi-scale hand feature maps $\mathbf{T}_{hand}$ and face feature maps $\mathbf{T}_{face}$ , according to the predicted hand and face bounding boxes, which are regressed from $\mathbf{T_{f}}^{\prime}$ using FFNs. The up-sampling and decoding processes for hand and face components are the same, and we illustrate the case of hand parameter regression in detail in Figure 3(b). The cropped multi-scale hand features can be represented as $\mathbf{T}_{hand}=\{\mathbf{F}_{lr},...,\mathbf{F}_{hr}\}$ . The low-resolution feature $\mathbf{F}_{lr}\in\mathbb{R}^{\frac{H^{\prime}}{M}\times\frac{W^{\prime}}{M}\times C}$ is cropped from the original low-resolution feature map, where $H^{\prime}$ and $W^{\prime}$ are the height and width of hand image patches. $\mathbf{F}_{hr}$ is the highest-resolution feature. The cropped multi-scale features then serve as memory tokens $\mathbf{V}$ for the keypoint-guided component-aware decoder. To relieve the computational pressure, we reduce the token dimension from $C$ to $C^{\prime}$ in the component-aware decoder, where $C^{\prime}=C/2$ .
上采样以获得多尺度高分辨率特征。由于人体图像中的手和脸通常很小，因此以往的方法会对人体图像进行上采样，然后裁剪掉手和脸，以获得更高分辨率的图像。然而，这种图像级的升采样-裁剪方案需要额外的骨干来分别提取手部和面部特征。为了解决这个问题，我们受最近的 ViTDet [ 34] 的启发，提出了一种可微分的特征级上采样-裁剪策略，以增强手和脸的回归过程。具体来说，我们将特征标记 $\mathbf{T_{f}}^{\prime}$ 重塑为特征图，并通过解卷积层将其上采样为多个更高分辨率的特征 $\mathbf{T}_{hr}$ 。然后，由于从完整的特征图中解码手部和面部组件信息不可避免地会导致冗余计算，并使计算过程效率低下，因此我们在特征图上执行可微分的 RoIAlign [ 21] 并根据预测的手部和面部边界框裁剪出多尺度的手部特征图 $\mathbf{T}_{hand}$ 和面部特征图 $\mathbf{T}_{face}$ ，这些边界框是使用 FFN 从 $\mathbf{T_{f}}^{\prime}$ 回归而来的。手部和面部组件的上采样和解码过程是相同的，我们在图 3(b) 中详细说明了手部参数回归的情况。裁剪后的多尺度手部特征可表示为 $\mathbf{T}_{hand}=\{\mathbf{F}_{lr},...,\mathbf{F}_{hr}\}$ 。低分辨率特征 $\mathbf{F}_{lr}\in\mathbb{R}^{\frac{H^{\prime}}{M}\times\frac{W^{\prime}}{M}\times C}$ 是从原始低分辨率特征图中裁剪出来的，其中 $H^{\prime}$ 和 $W^{\prime}$ 是手部图像斑块的高度和宽度。 $\mathbf{F}_{hr}$ 是最高分辨率特征。裁剪后的多尺度特征作为记忆标记 $\mathbf{V}$ 供关键点引导的组件感知解码器使用。为了减轻计算压力，我们在组件感知解码器中将标记维度从 $C$ 减小到 $C^{\prime}$ ，其中 $C^{\prime}=C/2$ .

Keypoint-guided deformable attention decoder. To improve the precision of hand and face parameter regression, we leverage 2D keypoint positions as prior knowledge to obtain better component tokens $\mathbf{T_{c}}$ than random initialization. We simply use the feature map $\mathbf{F}_{lr}$ to regress each 2D keypoint to trade off accuracy and efficiency and regard it as a reference keypoint. The input $\mathbf{T_{c}}\in\mathbb{R}^{K\times C^{\prime}}$ of the decoder, which we call the keypoint-guided component tokens, is obtained by summing up reference keypoint feature, pose positional embedding, and learnable embeddings. We then pass the keypoint-guided component token through $N$ deformable attention blocks as inspired by deformable DETR [75]. To relieve the issue of looking over all possible spatial locations, these blocks learn a small set of sampling points (e.g., four here) around the reference keypoint and further enlarge the feature spatial resolution while maintaining computational efficiency compared to vanilla DETR [9]. Each block is composed of a multi-head self-attention layer, a multi-scale deformable cross-attention layer, and FFNs. In the deformable cross-attention layer, keypoint queries $\mathbf{Q}$ extract features from the elements of multi-scale features $\mathbf{V}$ around the position of keypoints $p_{q}$ :
关键点引导的可变形注意力解码器。为了提高手部和面部参数回归的精度，我们利用二维关键点位置作为先验知识，以获得比随机初始化更好的组件令牌 $\mathbf{T_{c}}$ 。我们只需使用特征图 $\mathbf{F}_{lr}$ 对每个二维关键点进行回归，以权衡精度和效率，并将其视为参考关键点。解码器的输入 $\mathbf{T_{c}}\in\mathbb{R}^{K\times C^{\prime}}$ ，即我们所说的关键点引导的分量令牌，是由参考关键点特征、姿势位置嵌入和可学习嵌入相加得到的。然后，受可变形 DETR 的启发，我们将关键点引导的组件标记通过 $N$ 可变形关注块[75]。与 vanilla DETR[9]相比，这些区块学习了参考关键点周围的一小部分采样点（例如这里的四个），从而在保持计算效率的同时，进一步扩大了特征的空间分辨率。每个区块由多头自注意层、多尺度可变形交叉注意层和 FFN 组成。在可变形交叉注意层中，关键点查询 $\mathbf{Q}$ 从关键点位置 $p_{q}$ 周围的多尺度特征 $\mathbf{V}$ 元素中提取特征：

\text{CA}(\mathbf{Q},\mathbf{V},p_{q})=\sum_{l=1}^{L}\sum_{k=1}^{K}A_{lqk}W\mathbf{V}_{l}(\phi_{l}(p_{q})+\Delta p_{lqk}),

(1)

where $l$ and $k$ index the feature level and keys, $A$ and $W$ are attention weight and learnable parameter. $\phi(\cdot)$ and $\Delta p$ are position rescaling and offset. After that, the updated component tokens $\mathbf{T_{c}}^{\prime}\in\mathbb{R}^{K\times C^{\prime}}$ will be fed into hand or face regression head to output the final hand or face parameters ( $\hat{\mathbf{P}}_{lhand},\hat{\mathbf{P}}_{rhand},\hat{\mathbf{P}}_{face}$ ), respectively.
其中， $l$ 和 $k$ 表示特征级别和关键字， $A$ 和 $W$ 表示注意力权重和可学习参数。 $\phi(\cdot)$ 和 $\Delta p$ 是位置重缩和偏移。之后，更新后的分量令牌 $\mathbf{T_{c}}^{\prime}\in\mathbb{R}^{K\times C^{\prime}}$ 将被送入手部或面部回归头，分别输出最终的手部或面部参数（ $\hat{\mathbf{P}}_{lhand},\hat{\mathbf{P}}_{rhand},\hat{\mathbf{P}}_{face}$ ）。

Loss Function. OSX is trained in an end-to-end manner by minimizing the following loss function:
损失函数OSX 是通过最小化以下损失函数来进行端到端训练的：

L=L_{smplx}+L_{kpt3D}+L_{kpt2D}+L_{bbox2D}.

(2)

The four items are calculated as the L1 distance between the ground truth values and the predicted ones. Specifically, $L_{smplx}$ provides the explicit supervision of the SMPL-X parameters. $L_{kpt3D}$ , $L_{kpt2D}$ , and $L_{bbox2D}$ are regression losses for 3D whole-body keypoints, projected 2D whole-body keypoints, and left/right hands and face 2D bounding boxes. More details are provided in the Appendix. 重试错误原因

4 UBody–An Upper Body Dataset
4UBody--上半身数据集

3D whole-body mesh recovery from videos is a basic computer vision task, where it can provide comprehensive motion, gesture, and expression information to understand how humans perceive and act. However, existing datasets lack scenes of downstream tasks, such as sign language recognition, gesture generation, emotion recognition, and real-life scenarios recorded as VLOGs, making recent state-of-the-art methods hard to generalize well on these scenes. Interestingly, these scenarios are more concerned with the representations of upper bodies. We take this insight and present a novel large-scale benchmark for the expressive upper body mesh recovery as shown in Figure 2(f) to (t), named UBody. Our annotation pipeline is in Figure 4. Due to the page limit, we put the data collection, data annotation processes, and annotation visualization in Appendix.
从视频中恢复三维全身网格是一项基本的计算机视觉任务，它可以提供全面的运动、手势和表情信息，从而了解人类的感知和行为方式。然而，现有的数据集缺乏下游任务的场景，如手语识别、手势生成、情感识别，以及以 VLOG 记录的真实生活场景，这使得近期最先进的方法难以在这些场景上很好地推广。有趣的是，这些场景更关注上半身的表征。我们从这一角度出发，提出了一个新颖的大规模上半身网格复原基准，如图 2（f）至（t）所示，命名为 UBody。我们的标注流程如图 4 所示。由于篇幅限制，我们将数据收集、数据标注过程和标注可视化放在附录中。

4.1 Quality Analysis 4.1 质量分析

Our annotation pipeline produces far better 3D pseudo-GT fits with a shorter running time than the previous optimization-based and learning-based methods [26, 16, 49, 40, 50]. Figure 5(a) compares our 2D annotation results with the two wildly used annotation methods (OpenPose [8] and MediaPipe [69]). The quality of our 2D annotations is much more accurate, especially in terms of hand details and the robustness of occlusion and blur. Figure 5(b) compares the 3D annotation of ours with the SOTA NeuralAnnot method [40] on COCO. The quality of our approach is also better for the naked eye in terms of the fit of the body shape and the whole-body poses.
与之前基于优化和学习的方法[26, 16, 49, 40, 50]相比，我们的注释管道能以更短的运行时间生成更好的三维伪 GT 拟合结果。图 5(a) 比较了我们的二维注释结果和两种常用注释方法（OpenPose [ 8] 和 MediaPipe [ 69]）。我们的二维标注质量要准确得多，尤其是在手的细节以及遮挡和模糊的稳健性方面。图 5(b) 比较了我们的三维注释与 COCO 上的 SOTA NeuralAnnot 方法[40]。在体形和全身姿势的拟合方面，我们的方法的肉眼质量也更好。

4.2 Data Characteristics 4.2 数据特征

Compared to the popular datasets illustrated in Figure 2 (a) to (e) and the related human-centric datasets listed in Table 1, UBody possesses unique features that present new challenges for future research. Many videos are from edited media with highly diverse scenes and rich human actions and gestures. They have abrupt shot changes and dynamic camera viewpoints, leading to discontinuities between the frames. Close-up shots of humans cause severe truncation, making existing methods tend to fail. Meanwhile, they have varying degrees of interaction with objects and body components, subtitles, and special effects as occluded scenes. Also, there are high variations in background and light. Those conditions have not appeared in previous datasets. All scenes in UBody have rich hand gestures and facial expressions, making the recognition models pay more attention to these important body components. Lastly, all of these real-life videos provide audio as additional information to serve future multi-modality methods. We also provide statistical comparisons between the key features of UBody and the wildly used dataset AGORA [47] in Figure 6. AGORA’s hand/face bounding box area is generally small, while UBody pays more attention to diverse hand and face scales as evidenced by its more dispersed area distribution. Meanwhile, UBody has more visible face/hand keypoints, underscoring the importance of recognizing hand gestures and facial expressions. Lastly, UBody’s inclusion of real-life videos provides new possibilities for subsequent spatio-temporal modeling that are not available in AGORA, which is an image-based dataset.
与图 2 (a) 至 (e) 所示的流行数据集和表 1 所列的以人为中心的相关数据集相比，UBody 具有独特的特征，为未来研究带来了新的挑战。许多视频来自编辑过的媒体，具有高度多样化的场景和丰富的人类动作和姿态。它们有突然的镜头变化和动态的摄像机视角，导致帧与帧之间的不连续性。人类的特写镜头会造成严重的截断，使现有方法容易失效。同时，它们与物体和身体部件、字幕以及作为遮挡场景的特效之间存在不同程度的交互。此外，背景和光线的变化也很大。这些情况在以前的数据集中都没有出现过。UBody 中的所有场景都有丰富的手势和面部表情，这使得识别模型更加关注这些重要的身体组件。最后，所有这些真实视频都提供了音频作为附加信息，以服务于未来的多模态方法。我们还在图 6 中提供了 UBody 的关键特征与广泛使用的数据集 AGORA [ 47] 之间的统计比较。AGORA 的手部/脸部边界框面积普遍较小，而 UBody 则更加关注不同的手部和脸部尺度，其面积分布更加分散就是明证。同时，UBody 有更多可见的面部/手部关键点，强调了识别手势和面部表情的重要性。最后，UBody 包含的真实生活视频为后续的时空建模提供了新的可能性，这是 AGORA（基于图像的数据集）所不具备的。

5 Experiment 5 实验

5.1 Experimental Setup 5.1 实验设置

Due to the page limit, we leave the detailed experiment setup, implementation, annotation visualization, qualitative comparison with SOTA methods, and more benchmark results and analyses in the appendix.
由于篇幅有限，我们将详细的实验设置、实施、注释可视化、与 SOTA 方法的定性比较以及更多基准结果和分析放在附录中。

Datasets. We use COCO-Wholebody [24], MPII [3], and Human3.6M [23] as the training set. Unlike previous multi-stage methods [48, 39], we do not use additional hand-only and face-only datasets for training as a simple baseline for a one-stage method. The SMPL/SMPL-X pseudo-GTs are obtained from EFT [25] and NeuralAnnot [40].
数据集。我们使用 COCO-Wholebody [ 24]、MPII [ 3] 和 Human3.6M [ 23] 作为训练集。与之前的多阶段方法不同[ 48, 39]，我们没有使用额外的纯手和纯脸数据集作为单阶段方法的简单基线进行训练。SMPL/SMPL-X 伪 GT 来自 EFT [ 25] 和 NeuralAnnot [ 40]。

Evaluation metrics. For 3D whole-body mesh recovery, we utilize the mean per-vertex position error (MPVPE) as our primary metric. In addition, we apply Procrustes Analysis (PA) to the recovered mesh, and report the PA-MPVPE after rigid alignment. For AGORA, we also report normalized mean vertex error (N-PMVPE) to compensate for missing detection. Hand error is calculated as the mean of the left and right hands. For 3D body-only recovery on 3DPW, we follow previous works [67, 32] to report the mean per joint position error (MPJPE) and PA-MPJPE. All reported errors are in units of millimeters.
评估指标。对于三维全身网格恢复，我们使用平均每个顶点位置误差（MPVPE）作为主要指标。此外，我们还对恢复的网格进行了普氏分析（Procrustes Analysis，PA），并报告了刚性对齐后的 PA-MPVPE。对于 AGORA，我们还报告归一化平均顶点误差（N-PMVPE），以补偿检测缺失。手部误差按左右手的平均值计算。对于 3DPW 上的三维纯身体恢复，我们沿用了之前的研究成果[67, 32]，报告了每个关节位置的平均误差（MPJPE）和 PA-MPJPE。所有报告的误差均以毫米为单位。

Implementation details. OSX is implemented in Pytorch and trained using the Adam optimizer with an initial learning rate of $1\times 10^{-4}$ for 14 epochs. Scaling, rotation, random horizontal flip, and color jittering are used as data augmentations during training. We set the number of body tokens $\mathbf{T}_{b}$ and component tokens $\mathbf{T}_{c}$ to 27 and 92, respectively.
实现细节OSX 由 Pytorch 实现，使用 Adam 优化器进行训练，初始学习率为 $1\times 10^{-4}$ ，共训练 14 次。在训练过程中，我们使用了缩放、旋转、随机水平翻转和颜色抖动作为数据增强。我们将主体标记 $\mathbf{T}_{b}$ 和组件标记 $\mathbf{T}_{c}$ 的数量分别设置为 27 个和 92 个。

Method 方法	AGORA-test AGORA 测试					EHF						3DPW
	MPVPE $\downarrow$			N-MPVPE $\downarrow$		MPVPE $\downarrow$			PA-MPVPE $\downarrow$			MPJPE $\downarrow$	PA-MPJPE $\downarrow$
	All 全部	Hands 手	Face 面孔	All 全部	Body 身体	All 全部	Hands 手	Face 面孔	All 全部	Hands 手	Face 面孔	Body 身体	Body 身体
ExPose [48] ExPose [ 48］	217.3	73.1	51.1	265.0	184.8	77.1	51.6	35.0	54.5	12.8	5.8	93.4	60.7
FrankMocap [56] 弗兰克-莫卡普 [ 56］	-	55.2	-	-	207.8	107.6	42.8	-	57.5	12.6	-	96.7	61.9
PIXIE [16] PIXIE [ 16］	191.8	49.3	50.2	233.9	173.4	89.2	42.8	32.7	55.0	11.1	4.6	91.0	61.3
Hand4Whole [39] 手4全[ 39］	-	-	-	-	-	79.2	43.2	25.0	53.1	12.1	5.8	-	-
Hand4Whole [39] $\times$ 手4全 [ 39] $\times$	135.5	47.2	41.6	144.1	96.0	76.8	39.8	26.1	50.3	10.8	5.8	86.6	54.4
OSX (Ours) OSX （我们的）	122.8 $\downarrow_{9.5\%}$	45.7	36.2	130.6	85.3	70.8 $\downarrow_{7.8\%}$	53.7	26.4	48.7	15.9	6.0	74.7 $\downarrow_{13.4\%}$	45.1

Table 3: 3D body reconstruction error comparisons on three existing datasets.

\times

uses additional hand-only and face-only training datasets.
表 3：三个现有数据集的三维人体重建误差比较。

\times

使用了额外的纯手和纯脸训练数据集。

Hand 手	Ours 我们的	w/o H.D. 无 H.D.	w/o K.G 无 K.G	w/o both 重试错误原因
MPVPE	53.7	55.3	55.1	56.4
PA-MPVPE	15.9	17.7	17.6	18.1
Face 面孔	Ours 我们的	w/o F.D. 无 F.D.	w/o K.G 无 K.G	w/o both 无
MPVPE	26.4	27.2	26.4	26.8
PA-MPVPE	6.0	5.9	5.8	6.0
Upsampling 升采样	$\times$ 1	$\times$ 2	$\times$ 4	$\times$ 8
MPVPE	54.9	54.3	53.7	54.1

Table 4: Ablation study of component-aware decoder on EHF with H.D., F.D., K.G, and upsampling strategies. H.D., F.D., and K.G are abbreviations for Hand Decoder, Face Decoder and Keypoint-Guided scheme.
表 4：采用 H.D.、F.D.、K.G 和上采样策略在 EHF 上对组件感知解码器进行的消融研究。H.D.、F.D.和 K.G 是手部解码器、面部解码器和关键点引导方案的缩写。

Method 方法	MPVPE $\downarrow$			PA-MPVPE $\downarrow$			PA-MPJPE $\downarrow$
Method 方法	All 全部	Hand 手	Face 面孔	All 全部	Hand 手	Face 面孔	Body 身体	Hand 手
ExPose [48] ExPose [ 48］	171.5	83.7	45.1	66.9	12.0	3.9	70.7	12.3
PIXIE [16] PIXIE [ 16］	168.4	55.6	45.2	61.7	12.2	4.2	66.8	12.3
Hand4Whole [39] 手4全[ 39］	104.1	45.7	27.0	44.8	8.9	2.8	45.5	9.0
Hand4Whole [39] $\times$ 手4全 [ 39] $\times$	157.4	62.2	49.8	82.2	9.8	3.9	92.8	10.0
OSX (Ours) OSX （我们的）	92.4	47.7	24.9	42.4	10.8	2.4	42.9	11.0
OSX (Ours) $\dagger$ OSX（我们的） $\dagger$	81.9	41.5	21.2	42.2	8.6	2.0	48.4	8.8

Table 5: Reconstruction errors on UBody test set on the intra-scene protocol. All models are pretrained on previous datasets, except for the results labeled by (i)

\dagger

: finetuned on the UBody training data; (ii)

\times

: finetuned on the AGORA training data. The result of the inter-scene setting is in the appendix.
表 5：在 UBody 测试集上场景内协议的重建误差。所有模型都在以前的数据集上进行了预训练，但标注为（i）

\dagger

：在 UBody 训练数据上进行了微调；（ii）

\times

：在 AGORA 训练数据上进行了微调的结果除外。场景间设置的结果见附录。

5.2 Comparisons with Existing Methods
5.2 与现有方法的比较

Table 3 provides a comprehensive comparison of OSX and existing whole-body mesh recovery methods. As the first one-stage method, OSX surpasses existing multi-stage models with complex designs in most cases. Notably, OSX has not been trained on hand-only and face-only datasets [76, 29, 41]. Our All MPVPEs show a $9.5$ % improvement on AGORA test set and $7.8$ % improvement on EHF than SOTA [39]. Since AGORA is a more complex and natural dataset than EHF, previous works [39, 47] claim it is more convincing and representative of real-world scenarios. We also visualize the misleading high-error cases on EHF in Figure 7. Besides, we obtain a SOTA performance on the body-only dataset, 3DPW, with a $13.4$ % error reduction compared to these whole-body methods. More qualitative results are available in the appendix.
表 3 对 OSX 和现有的全身网格复原方法进行了全面比较。作为第一种单阶段方法，OSX 在大多数情况下都超越了现有的复杂设计的多阶段模型。值得注意的是，OSX 还没有在纯手和纯脸数据集上进行过训练[76, 29, 41]。与 SOTA 相比，我们的所有 MPVPE 在 AGORA 测试集上的改进幅度为 0%，在 EHF 上的改进幅度为 1%[39]。由于 AGORA 是一个比 EHF 更复杂、更自然的数据集，因此之前的研究[39, 47]认为它更有说服力，更能代表真实世界的场景。我们还在图 7 中展示了 EHF 上误导性的高错误案例。此外，我们还在纯身体数据集 3DPW 上获得了 SOTA 性能，与这些全身方法相比，误差减少了 @2 %。更多定性结果见附录。

5.3 Ablation Study 5.3 消融研究

Impact of the component-aware decoder. Unlike body-only pose estimation, whole-body mesh recovery requires attention to both the body’s posture, which is on a larger spatial scale, and the gesture and expression of the hands and face, which are on a finer scale. To handle the resolution issue in a one-stage pipeline, we propose the component-aware decoder attached to the component-aware encoder. First, in the upper Table 5, we verify the effectiveness of the proposed decoder for both hand and face regression. We can observe a significant drop without the decoder (e.g., w/o H.D. and w/o F.D., indicating that simply regressing the low-resolution hand and face directly from the encoder is inferior. Moreover, the errors will also increase without the proposed keypoint-guided deformable attention scheme, as shown in the medium Table 5. In particular, the performance of the hand estimation is highly influenced, showing that hand pose estimation attends more to the sparsely deformable spatial information to obtain better queries.
组件感知解码器的影响。与单纯的身体姿态估计不同，全身网格恢复既需要关注空间尺度较大的身体姿态，也需要关注尺度较小的手部和面部的姿态和表情。为了在单级流水线中处理分辨率问题，我们提出将组件感知解码器连接到组件感知编码器上。首先，在上表 5 中，我们验证了所建议的解码器在手部和面部回归方面的有效性。我们可以观察到，在没有解码器的情况下（例如，不使用 H.D.和 F.D.的情况下），误差明显下降，这表明直接从编码器回归低分辨率手部和面部的效果较差。此外，如表 5 所示，如果不采用建议的关键点引导可变形注意力方案，误差也会增加。其中，手部估计的性能受到很大影响，这表明手部姿态估计更多地关注稀疏的可变形空间信息，以获得更好的查询结果。

Impact of the up-sampling strategy. To relieve the low-resolution problem of hand and facial features, we design the feature up-sampling strategy in the decoder to obtain multi-scale higher-resolution features. The lower Table 5 presents the impact of different up-sampling scale. As the up-sampling scale increases, the MPVPE decreases and then reaches a saturation point. Therefore, we use three scales (i.e., $[\times 1,\times 2,\times 4]$ ) by default in our experiments.
上采样策略的影响。为了缓解手部和面部特征的低分辨率问题，我们在解码器中设计了特征上采样策略，以获得多尺度的高分辨率特征。下表 5 列出了不同上采样比例的影响。随着上采样比例的增加，MPVPE 会下降，然后达到饱和点。因此，我们在实验中默认使用三种尺度（即 $[\times 1,\times 2,\times 4]$ ）。

5.4 Benchmark on UBody 5.4 UBody 的基准测试

As a new dataset, we provide both quantitative and qualitative results on UBody. Table 5 presents the performance comparisons of existing 3D whole-body methods. The general result ranking is similar to AGORA. Since the upper body is closer to the camera, their errors will be smaller than AGORA. However, the hand and face will play a more important role than previous data. Besides, we finetune Hand4Whole on AGORA and test again, and we find all errors are significantly enlarged. This observation can be attributed to the data distribution gap between AGORA and UBody, as shown in Figure 6. Moreover, we train OSX on our train set and find a 16.1% improvement compared to the original pretrained model, indicating that UBody can serve to improve the performance on downstream real-life scenes.
作为一个新的数据集，我们提供了 UBody 的定量和定性结果。表 5 列出了现有三维全身方法的性能比较。总体结果排名与 AGORA 相似。由于上半身离摄像机更近，其误差会小于 AGORA。但是，手部和脸部的作用会比之前的数据更重要。此外，我们在 AGORA 上对 Hand4Whole 进行了微调，并再次进行测试，结果发现所有误差都明显增大。如图 6 所示，这一现象可以归因于 AGORA 和 UBody 在数据分布上的差距。此外，我们在训练集上训练 OSX，发现与原始预训练模型相比，OSX 提高了 16.1%，这表明 UBody 可以提高下游真实场景的性能。

6 Conclusion 6 结束语

In this work, we propose the first one-stage pipeline for 3D whole-body mesh recovery that achieves SOTA performance on three benchmarks in a simple yet effective manner. Moreover, to bridge the gap between the basic task of full-body pose and shape estimation and their downstream tasks, we develop a large-scale dataset with comprehensive scenes covering our daily life. With our proposed annotation method, we show that training on UBody can effectively improve the performance of mesh recovery in upper-body scenes. We hope this work can contribute new insights to this area, both in terms of methodology and dataset.
在这项工作中，我们首次提出了用于三维全身网格恢复的单级流水线，以简单而有效的方式在三个基准上实现了 SOTA 性能。此外，为了缩小全身姿态和形状估计这一基本任务与其下游任务之间的差距，我们开发了一个大规模数据集，其中包含涵盖日常生活的综合场景。通过我们提出的注释方法，我们证明在 UBody 上进行训练可以有效提高上半身场景中的网格恢复性能。我们希望这项工作能在方法论和数据集方面为这一领域贡献新的见解。

Limitation and future work. Currently, our training does not use additional hand and face-specific datasets. It is worth studying how to make the best use of them in our pipeline to further improve performance. Also, we can validate the effectiveness of UBody on some downstream applications, e.g., gesture recognition, driving avatar.
局限性和未来工作。目前，我们的训练没有使用额外的手部和面部特定数据集。如何在我们的管道中充分利用这些数据集以进一步提高性能值得研究。此外，我们还可以在一些下游应用中验证 UBody 的有效性，如手势识别、驾驶化身等。

Acknowledgements: This work was partially funded through the National Key Research and Development Program of China (Project No.2022YFB36066), in part by the Shenzhen Science and Technology Project under Grant (CJGJZD20200617102601004, JCYJ20220818101001004).
致谢：本研究得到了国家重点研发计划（项目编号：2022YFB36066）的部分资助，以及深圳市科技计划项目（CJGJZD20200617102601004, JCYJ20220818101001004）的部分资助。

References 参考资料

[1] Easymocap - make human motion capture easier. Github, 2021.
Easymocap - 让人体动作捕捉更简单。Github, 2021.
[2] Oswald Aldrian and William A. P. Smith. Inverse rendering of faces with a 3d morphable model. TPAMI, 2013.
奥斯瓦尔德-奥尔德里安和威廉-A-P-史密斯。用三维可变形模型反向渲染人脸。TPAMI，2013。
[3] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
Mykhaylo Andriluka、Leonid Pishchulin、Peter Gehler 和 Bernt Schiele。二维人体姿态估计：新基准和最新技术分析。CVPR，2014。
[4] Adnane Boukhayma, Rodrigo de Bem, and Philip H. S. Torr. 3d hand shape and pose from images in the wild. In CVPR, 2019.
Adnane Boukhayma、Rodrigo de Bem 和 Philip H. S. Torr。野外图像中的 3D 手形和姿势。In CVPR, 2019.
[5] Andrew Brown, Vicky Kalogeiton, and Andrew Zisserman. Face, body, voice: Video person-clustering with multiple modalities. In ICCV, 2021.
安德鲁-布朗（Andrew Brown）、维基-卡洛盖顿（Vicky Kalogeiton）和安德鲁-齐瑟曼（Andrew Zisserman）。脸、身体、声音：多模态视频人物聚类。2021 年，ICCV。
[6] Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In ECCV, 2022.
蔡中刚、任大萱、曾爱玲、林正宇、于涛、王文佳、范翔宇、高扬、于一帆、潘亮、洪方舟、张明远、陈改洛、杨磊、刘紫薇。人类：用于多功能传感和建模的多模态 4d 人体数据集。2022 年欧洲计算机视觉大会。
[7] Necati Cihan Camgöz, Ben Saunders, Guillaume Rochette, Marco Giovanelli, Giacomo Inches, Robin Nachtrab-Ribback, and Richard Bowden. Content4all open research sign language translation datasets. In FG. IEEE, 2021.
Necati Cihan Camgöz、Ben Saunders、Guillaume Rochette、Marco Giovanelli、Giacomo Inches、Robin Nachtrab-Ribback 和 Richard Bowden。Content4all 开放式研究手语翻译数据集。In FG.IEEE, 2021.
[8] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: Realtime multi-person 2d pose estimation using part affinity fields. TPAMI, 2021.
曹哲、吉内斯-伊达尔戈、托马斯-西蒙、魏世恩和雅瑟-谢赫。OpenPose：使用部分亲和场进行实时多人 2d 姿势估计。TPAMI, 2021.
[9] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
Nicolas Carion、Francisco Massa、Gabriel Synnaeve、Nicolas Usunier、Alexander Kirillov 和 Sergey Zagoruyko。用变换器进行端到端物体检测。ECCV, 2020.
[10] Theocharis Chatzis, Andreas Stergioulas, Dimitrios Konstantinidis, Kosmas Dimitropoulos, and Petros Daras. A comprehensive study on deep learning-based 3d hand pose estimation methods. Applied Sciences, 2020.
Theocharis Chatzis、Andreas Stergioulas、Dimitrios Konstantinidis、Kosmas Dimitropoulos 和 Petros Daras.基于深度学习的 3D 手部姿态估计方法综合研究》。应用科学，2020。
[11] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV. 2020.
Hongsuk Choi、Gyeongsik Moon 和 Kyoung Mu Lee。Pose2mesh：从二维人体姿势恢复三维人体姿势和网格的图卷积网络。In ECCV.2020.
[12] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In CVPRW, 2019.
邓宇、杨蛟龙、徐思成、陈东、贾云德、童欣。弱监督学习下的精确三维人脸重建：从单张图像到图像集。In CVPRW, 2019.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、Xiaohua Zhai、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly、Jakob Uszkoreit 和 Neil Houlsby。一幅图像胜过 16x16 个单词：大规模图像识别变换器。In ICLR, 2021.
[14] Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, and Xavier Giro-i Nieto. How2sign: A large-scale multimodal dataset for continuous american sign language. In CVPR, 2021.
Amanda Duarte、Shruti Palaskar、Lucas Ventura、Deepti Ghadiyaram、Kenneth DeHaan、Florian Metze、Jordi Torres 和 Xavier Giro-i Nieto。How2sign：连续美国手语的大规模多模态数据集。In CVPR, 2021.
[15] Bernhard Egger, William A. P. Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer, Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, Christian Theobalt, Volker Blanz, and Thomas Vetter. 3d morphable face models—past, present, and future. TOG, 2020.
Bernhard Egger、William A. P. Smith、Ayush Tewari、Stefanie Wuhrer、Michael Zollhoefer、Thabo Beeler、Florian Bernard、Timo Bolkart、Adam Kortylewski、Sami Romdhani、Christian Theobalt、Volker Blanz 和 Thomas Vetter。3D可变形人脸模型--过去、现在和未来。TOG, 2020.
[16] Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Collaborative regression of expressive bodies using moderation. In 3DV, 2021.
Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black.使用节制对表达体进行协作回归。在 3DV, 2021.
[17] David F Fouhey, Wei-cheng Kuo, Alexei A Efros, and Jitendra Malik. From lifestyle vlogs to everyday interactions. In CVPR, 2018.
David Fouhey、Wei-cheng Kuo、Alexei A Efros 和 Jitendra Malik。从生活日志到日常互动。CVPR，2018。
[18] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
顾春晖、孙晨、David A. Ross、Carl Vondrick、Caroline Pantofaru、李叶青、Sudheendra Vijayanarasimhan、George Toderici、Susanna Ricco、Rahul Sukthankar、Cordelia Schmid 和 Jitendra Malik。AVA：时空定位的原子视觉动作视频数据集。In CVPR, 2018.
[19] Lin Guo, Zongxing Lu, and Ligang Yao. Human-machine interaction sensing technology based on hand gesture recognition: A review. IEEE Transactions on Human-Machine Systems, 2021.
Lin Guo、Zongxing Lu 和 Ligang Yao.基于手势识别的人机交互传感技术：综述。电气和电子工程师学会人机系统论文集，2021年。
[20] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022.
何开明、陈新磊、谢赛宁、李阳浩、Piotr Dollár 和 Ross Girshick。遮蔽式自动编码器是可扩展的视觉学习器。CVPR，2022 年。
[21] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
何开明、Georgia Gkioxari、Piotr Dollar 和 Ross Girshick。掩码 R-CNN。In ICCV, 2017.
[22] Lin Huang, Zhang Boshen, Zhilin Guo, Yang Xiao, Zhiguo Cao, and Junsong Yuan. Survey on depth and rgb image-based 3d hand shape and pose estimation. Virtual Reality & Intelligent Hardware, 2021.
黄林、张伯申、郭志林、肖阳、曹志国和袁俊松。基于深度和RGB图像的3D手部形状和姿态估计研究。虚拟现实与智能硬件，2021年。
[23] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI, 2014.
Catalin Ionescu、Dragos Papava、Vlad Olaru 和 Cristian Sminchisescu。Human3.6M：用于自然环境中 3d 人体感应的大规模数据集和预测方法。TPAMI，2014。
[24] Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. Whole-body human pose estimation in the wild. In ECCV, 2020.
金晟、徐璐敏、徐瑾、王灿、刘文涛、钱晨、欧阳万里、罗平。野外全身人体姿态估计。在 ECCV，2020。
[25] Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. 2020.
Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi.面向野外三维人体姿态估计的三维人体姿态拟合范例微调。2020.
[26] Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi. Exemplar fine-tuning for 3d human model fitting towards in-the-wild 3d human pose estimation. 3DV, 2022.
Hanbyul Joo, Natalia Neverova, and Andrea Vedaldi.面向野外三维人体姿态估计的三维人体模型拟合范例微调。3DV，2022年。
[27] Hamid Reza Vaezi Joze and Oscar Koller. Ms-asl: A large-scale data set and benchmark for understanding american sign language. ArXiv, abs/1812.01053, 2019.
Hamid Reza Vaezi Joze 和 Oscar Koller.Ms-asl：理解美国手语的大规模数据集和基准。ArXiv，abs/1812.01053，2019。
[28] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In CVPR, 2018.
Angjoo Kanazawa、Michael J. Black、David W. Jacobs 和 Jitendra Malik。端到端恢复人体形状和姿势。In CVPR, 2018.
[29] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. TPAMI, 2021.
Tero Karras、Samuli Laine 和 Timo Aila.基于风格的生成式对抗网络生成器架构。TPAMI，2021年。
[30] Muhammed Kocabas, Chun-Hao P Huang, Otmar Hilliges, and Michael J Black. Pare: Part attention regressor for 3d human body estimation. In ICCV, 2021.
Muhammed Kocabas、Chun-Hao P Huang、Otmar Hilliges 和 Michael J Black。帕雷：用于 3D 人体估算的部分注意力回归器。2021 年，ICCV。
[31] Muhammed Kocabas, Chun-Hao P Huang, Joachim Tesch, Lea Müller, Otmar Hilliges, and Michael J Black. Spec: Seeing people in the wild with an estimated camera. In CVPR, 2021.
Muhammed Kocabas、Chun-Hao P Huang、Joachim Tesch、Lea Müller、Otmar Hilliges 和 Michael J Black。Spec：用估计相机看野外的人。CVPR，2021年。
[32] Nikos Kolotouros, Georgios Pavlakos, Michael Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019.
Nikos Kolotouros、Georgios Pavlakos、Michael Black 和 Kostas Daniilidis。通过循环中的模型拟合学习重建 3d 人体姿势和形状。2019年，ICCV。
[33] Zijian Kuang and Xinran Tie. Flow-based video segmentation for human head and shoulders. arXiv preprint arXiv:2104.09752, 2021.
匡子健、铁欣然。基于流的人体头部和肩部视频分割》，arXiv preprint arXiv:2104.09752, 2021.
[34] Yanghao Li, Hanzi Mao, Ross B. Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
李阳昊、毛汉子、罗斯-B-吉尔希克、何开明。探索用于物体检测的平视变换器骨干。在2022年的ECCV大会上
[35] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.微软 coco：上下文中的通用对象。2014年欧洲计算机大会。
[36] Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, and Guoying Zhao. imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In CVPR, 2021.
刘昕、史恒林、陈皓宇、于子彤、李小白和赵国英：用于微姿态理解和情感分析的无身份视频数据集。In CVPR, 2021.
[37] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, year=2017.
Dushyant Mehta、Helge Rhodin、Dan Casas、Pascal Fua、Oleksandr Sotnychenko、Weipeng Xu 和 Christian Theobalt。使用改进的 cnn 监督进行野外单目 3D 人体姿态估计。3DV, year=2017.
[38] Sushmita Mitra and Tinku Acharya. Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2007.
Sushmita Mitra 和 Tinku Acharya.手势识别：调查。IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2007.
[39] Gyeongsik Moon, Hongsuk Choi, and Kyoung Mu Lee. Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In CVPRW, 2022.
Gyeongsik Moon, Hongsuk Choi, and Kyoung Mu Lee.用于全身三维人体网格估算的精确三维手部姿势估算。CVPRW，2022年。
[40] Gyeongsik Moon, Hongsuk Choi, and Kyoung Mu Lee. NeuralAnnot: Neural annotator for 3d human mesh training sets. In CVPRW, 2022.
Gyeongsik Moon、Hongsuk Choi 和 Kyoung Mu Lee。NeuralAnnot：用于 3D 人体网格训练集的神经注释器。2022 年，CVPRW。
[41] Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. ECCV, 2020.
Gyeongsik Moon、Shoou-I Yu、He Wen、Takaaki Shiratori 和 Kyoung Mu Lee。Interhand2.6m：从单张RGB图像估算3D交互手姿势的数据集和基线。ECCV，2020。
[42] Lea Muller, Ahmed AA Osman, Siyu Tang, Chun-Hao P Huang, and Michael J Black. On self-contact and human pose. In CVPR, 2021.
Lea Muller、Ahmed AA Osman、Siyu Tang、Chun-Hao P Huang 和 Michael J Black。关于自我接触和人类姿势。CVPR，2021年。
[43] Lea Muller, Ahmed A. A. Osman, Siyu Tang, Chun-Hao P. Huang, and Michael J. Black. On self-contact and human pose. In CVPR, 2021.
Lea Muller、Ahmed A. A. Osman、Siyu Tang、Chun-Hao P. Huang 和 Michael J. Black。关于自我接触和人类姿势。In CVPR, 2021.
[44] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 2020.
Arsha Nagrani、Joon Son Chung、Weidi Xie 和 Andrew Zisserman。Voxceleb：野生大规模说话者验证。计算机语音与语言》，2020 年。
[45] Supreeth Narasimhaswamy, Thanh Nguyen, Mingzhen Huang, and Minh Hoai. Whose hands are these? hand detection and hand-body association in the wild. In CVPR, 2022.
Supreeth Narasimhaswamy、Thanh Nguyen、Mingzhen Huang 和 Minh Hoai。这是谁的手？野外手部检测与手体关联。CVPR，2022 年。
[46] Hui En Pang, Zhongang Cai, Lei Yang, Tianwei Zhang, and Ziwei Liu. Benchmarking and analyzing 3d human pose and shape estimation beyond algorithms. In Neural Information Processing Systems Datasets and Benchmarks Track.
Hui En Pang, Zhongang Cai, Lei Yang, Tianwei Zhang, and Ziwei Liu.超越算法的三维人体姿态和形状估计基准测试与分析。神经信息处理系统数据集和基准研究》。
[47] Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regression analysis. In CVPR, 2021.
Priyanka Patel、Chun-Hao P. Huang、Joachim Tesch、David T. Hoffmann、Shashank Tripathi 和 Michael J. Black。AGORA：为回归分析优化的地理学头像。In CVPR, 2021.
[48] Georgios Pavlakos, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, Michael J. Black, Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Monocular expressive body regression through body-driven attention. ECCV, 2020.
Georgios Pavlakos, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, Michael J. Black, Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black.通过身体驱动注意力实现单目表现性身体回归。ECCV, 2020.
[49] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, 2019.
Georgios Pavlakos、Vasileios Choutas、Nima Ghorbani、Timo Bolkart、Ahmed A. Osman、Dimitrios Tzionas 和 Michael J. Black。富有表现力的肢体捕捉：从单张图像中捕捉 3d 手、脸和身体。In CVPR, 2019.
[50] Georgios Pavlakos, Jitendra Malik, and Angjoo Kanazawa. Human mesh recovery from multiple shots. In CVPR, 2022.
Georgios Pavlakos、Jitendra Malik 和 Angjoo Kanazawa。从多个镜头恢复人体网格2022 年，CVPR。
[51] Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
彭思达、张元庆、徐英豪、王倩倩、帅青、鲍虎军、周晓伟。神经体：具有结构化潜码的隐式神经表征，用于动态人体的新颖视图合成。在2021年CVPR大会上。
[52] William H Press and Saul A Teukolsky. Savitzky-golay smoothing filters. Computers in Physics, 1990.
William H Press 和 Saul A Teukolsky.萨维茨基-戈莱平滑滤波器》。物理学中的计算机》，1990 年
[53] Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic. Audio-visual object localization and separation using low-rank and sparsity. In ICASSP. IEEE, 2017.
Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic.利用低等级和稀疏性进行视听对象定位和分离。In ICASSP.IEEE，2017。
[54] Razieh Rastgoo, Kourosh Kiani, Sergio Escalera, and Mohammad Sabokrou. Sign language production: a review. In CVPR, 2021.
Razieh Rastgoo、Kourosh Kiani、Sergio Escalera 和 Mohammad Sabokrou。手语制作：综述。CVPR，2021。
[55] Chris Rockwell and David F Fouhey. Full-body awareness from partial observations. In ECCV, 2020.
克里斯-罗克韦尔和大卫-福黑。通过部分观测实现全身感知。ECCV, 2020.
[56] Yu Rong, Takaaki Shiratori, and Hanbyul Joo. FrankMocap: A monocular 3d whole-body pose estimation system via regression and integration. In ICCVW, 2021.
Yu Rong、Takaaki Shiratori 和 Hanbyul Joo。FrankMocap：通过回归和整合实现的单目 3D 全身姿态估计系统。In ICCVW, 2021.
[57] Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. Motion representations for articulated animation. In CVPR, 2021.
Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov.关节动画的运动表示。CVPR，2021 年。
[58] S Subburaj and S Murugavalli. Survey on sign language recognition in context of vision-based and deep learning. Measurement: Sensors, 2022.
S Subburaj 和 S Murugavalli.基于视觉和深度学习的手语识别调查。测量：传感器，2022。
[59] Yu Sun, Tianyu Huang, Qian Bao, Wu Liu, Wenpeng Gao, and Yili Fu. Learning monocular mesh recovery of multiple body parts via synthesis. In ICASSP. IEEE, 2022.
孙瑜、黄天宇、鲍倩、刘武、高文鹏、付一力。通过合成学习多个身体部位的单目网格复原。In ICASSP.IEEE, 2022.
[60] Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez, and Christian Theobalt. MoFA: Model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In ICCVW, 2017.
Ayush Tewari、Michael Zollhofer、Hyeongwoo Kim、Pablo Garrido、Florian Bernard、Patrick Perez 和 Christian Theobalt。MoFA：用于无监督单目重建的基于模型的深度卷积人脸自动编码器。In ICCVW, 2017.
[61] Yating Tian, Hongwen Zhang, Yebin Liu, and Limin Wang. Recovering 3d human mesh from monocular images: A survey. arXiv:2203.01923, 2022.
Yating Tian、Hongwen Zhang、Yebin Liu 和 Limin Wang.从单目图像中恢复三维人体网格：arXiv:2203.01923, 2022.
[62] Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using IMUs and a moving camera. In ECCV. 2018.
Timo von Marcard、Roberto Henschel、Michael J. Black、Bodo Rosenhahn 和 Gerard Pons-Moll。使用 IMUs 和移动摄像机在野外恢复精确的 3D 人体姿态。In ECCV.2018.
[63] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular total capture: Posing face, body, and hands in the wild. In CVPR, 2019.
Donglai Xiang, Hanbyul Joo, and Yaser Sheikh.单目全面捕捉：在野外摆出脸部、身体和手。In CVPR, 2019.
[64] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. Vitpose: Simple vision transformer baselines for human pose estimation. arXiv:2204.12484, 2022.
Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao.Vitpose：用于人体姿态估计的简单视觉变换基线。arXiv:2204.12484，2022。
[65] Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In ICRA, 2019.
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee.机器人学习社交技能：仿人机器人端到端协同语音手势生成学习。In ICRA, 2019.
[66] Ian T Young and Lucas J Van Vliet. Recursive implementation of the gaussian filter. Signal processing, 1995.
Ian T Young 和 Lucas J Van Vliet.高斯滤波器的递归实现。信号处理，1995 年。
[67] Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao, Xizhou Zhu, Bo Dai, and Qiang Xu. Deciwatch: A simple baseline for 10x efficient 2d and 3d pose estimation. In ECCV, 2022.
曾爱玲、鞠煊、杨磊、高瑞媛、朱西洲、戴波、徐强。Deciwatch：用于 10 倍高效 2d 和 3d 姿势估计的简单基线。2022 年 ECCV。
[68] Ailing Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang Xu. Smoothnet: A plug-and-play network for refining human poses in videos. In ECCV, 2022.
曾爱玲、杨磊、鞠煊、李杰峰、王建一和徐强。Smoothnet：用于完善视频中人体姿势的即插即用网络。在 2022 年的 ECCV 会议上。
[69] Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. Mediapipe hands: On-device real-time hand tracking. arXiv:2006.10214, 2020.
张帆、Valentin Bazarevsky、Andrey Vakunov、Andrei Tkachenka、George Sung、Chuo-Ling Chang 和 Matthias Grundmann。Mediapipe hands：ArXiv:2006.10214, 2020.
[70] Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng Li, Liang An, and Yebin Liu. Pymaf-x: Towards well-aligned full-body model regression from monocular images. arXiv:2207.06400, 2022.
Hongwen Zhang, Yating Tian, Yuxiang Zhang, Mengcheng Li, Liang An, and Yebin Liu.Pymaf-x：Pymaf-x: Towards well-aligned full-body model regression from monocular images. ArXiv:2207.06400, 2022.
[71] Jason Y Zhang, Panna Felsen, Angjoo Kanazawa, and Jitendra Malik. Predicting 3d human dynamics from video. In ICCV, 2019.
Jason Y Zhang、Panna Felsen、Angjoo Kanazawa 和 Jitendra Malik。从视频预测 3D 人体动态。2019年，ICCV。
[72] Weiyu Zhang, Menglong Zhu, and Konstantinos G Derpanis. From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV, 2013.
Weiyu Zhang, Menglong Zhu, and Konstantinos G Derpanis.从动作到动作：用于详细动作理解的强监督表示法。ICCV, 2013.
[73] Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia, 2021.
Hao Zhou, Wengang Zhou, Yun Zhou, and Houqiang Li.用于手语识别和翻译的时空多线索网络电气和电子工程师学会多媒体论文集，2021年。
[74] Yuxiao Zhou, Marc Habermann, Ikhsanul Habibie, Ayush Tewari, Christian Theobalt, and Feng Xu. Monocular real-time full body capture with inter-part correlations. In CVPR, 2021.
Yuxiao Zhou, Marc Habermann, Ikhsanul Habibie, Ayush Tewari, Christian Theobalt, and Feng Xu.具有部件间相关性的单目实时全身捕捉。2021 年，CVPR。
[75] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. In ICLR, 2021.
Xizhou Zhu、Weijie Su、Lewei Lu、Bin Li、Xiaogang Wang 和 Jifeng Dai。可变形 DETR：用于端到端对象检测的可变形变换器。2021 年，ICLR。
[76] Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max J. Argus, and Thomas Brox. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In ICCV, 2019.
Christian Zimmermann、Duygu Ceylan、Jimei Yang、Bryan Russell、Max J. Argus 和 Thomas Brox。 FreiHAND：从单张 RGB 图像无标记捕捉手部姿势和形状的数据集。In ICCV, 2019.

Supplementary Material: 补充材料：
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
利用组件感知变压器进行单级三维全身网格复原

Overview 付款方式

This supplementary material presents more details and additional results not included in the main paper due to page limitation. The list of items included are:
由于篇幅所限，本补充材料介绍了正文中未包含的更多细节和附加结果。包含的项目有

•

More experiment setup and details in Sec. A.

- 更多实验设置和细节见 A 节。
•

Efficiency comparison with SOTA in Sec. B.

- 与 SOTA 的效率比较见 B 部分。
•

Experiment on AGORA dataset in Sec. C.

- C 节中的 AGORA 数据集实验。
•

More introduction of UBody in Sec. D.

- 有关 UBody 的更多介绍见 D 节。
•

Inter-scene benchmark on UBody dataset in Sec. E.

- E 节中的 UBody 数据集场景间基准。
•

Qualitative comparisons with SOTA in Sec. F.

- 与 SOTA 的定性比较见 F 部分。

^†^†

\S

Work done during an internship at IDEA;

{\P}

Corresponding author.

A Experiment Setup 实验设置

Evaluation metrics. To quantitatively evaluate the performance of human mesh recovery, MPVPE, PA-MPVPE, MPJPE, and PA-MPJPE are used as evaluation metrics. Besides, we also report normalized mean vertex error (NMVE) and normalized mean joint error (NMJE) by the standard detection metric, F1 score (the harmonic mean of recall and precision) to penalize models for misses and false positives on AGORA test set with many multi-person scenes.
评估指标。为了定量评估人体网格复原的性能，我们使用 MPVPE、PA-MPVPE、MPJPE 和 PA-MPJPE 作为评估指标。此外，我们还通过标准检测指标 F1 分数（召回率和精确率的调和平均值）来报告归一化平均顶点误差（NMVE）和归一化平均联合误差（NMJE），以惩罚模型在多人场景 AGORA 测试集上的漏报和误报。

Implementation details. Our OSX model is implemented in Pytorch. It is trained with Adam optimizer ( $\beta_{1}=0.1,\beta_{2}=0.999$ ) using the Cosine Annealing scheme for 14 epochs. The learning rate is initially set to $1\times 10^{-4}$ . The batch size is set to 192. Random scaling, rotation, horizontal flip, and color jittering are used as data augmentations during training. The spatial size of the input image is $256\times 192$ . The number of body tokens $\mathbf{T}_{b}$ and component tokens $\mathbf{T}_{c}$ are set to 27 and 92, respectively. During experiments on the AGORA-test set, we remove the decoder as we find that the decoder increases training time and does not significantly improve performance on AGORA-test set. This observation may be attributed to the fact that the main problem of AGORA is occlusion, while the decoder aims to estimate hands/face at a finer level.
实施细节我们的 OSX 模型是用 Pytorch 实现的。它由 Adam 优化器（ $\beta_{1}=0.1,\beta_{2}=0.999$ ）使用余弦退火方案训练了 14 个历时。学习率初始设置为 $1\times 10^{-4}$ 。批次大小设置为 192。在训练过程中，使用随机缩放、旋转、水平翻转和颜色抖动作为数据增强。输入图像的空间大小为 $256\times 192$ 。主体标记 $\mathbf{T}_{b}$ 和组件标记 $\mathbf{T}_{c}$ 的数量分别设置为 27 和 92。在 AGORA 测试集上的实验中，我们移除了解码器，因为我们发现解码器会增加训练时间，而且在 AGORA 测试集上的性能提升并不明显。这可能是因为 AGORA 的主要问题是遮挡，而解码器的目的是在更精细的层次上估计手/脸。

B Efficiency comparison with SOTA methods
与 SOTA 方法的 BE 效率比较

We report the complexity comparisons including average inference time, number of model parameters, FLOPs, and the NMJE-All on AGORA-test in Table S-1. The numbers are measured for single-person regression on the same resolution input using a machine with an NVIDIA A100 GPU. OSX has the shortest inference time and lowest error, indicating the advantages in practical applications.
我们在表 S-1 中报告了复杂性比较，包括平均推理时间、模型参数数、FLOPs 和 AGORA 上的 NMJE-All 测试。这些数据是使用配备英伟达™（NVIDIA®）A100 GPU 的机器在相同分辨率输入条件下对单人回归进行的测量。OSX 的推理时间最短，误差最小，显示了其在实际应用中的优势。

Method 方法	ExPose [34] 重试错误原因	PIXIE [13]	H4W [27]	PyMAF-X [48]	OSX
NMJE-All (mm) 所有 NMJE（毫米）	263.3	230.9	141.1	140.0	127.6
Infer Time (ms) 推断时间（毫秒）	120.2	192.0	73.3	209.3	54.6
Params (M) 参数 (M)	135.8	192.9	77.9	205.9	102.9
FLOPS (G)	28.5	34.3	16.7	35.5	25.3

Table S-1: Efficiency comparisons with multi-stage methods.
表 S-1：多阶段方法的效率比较。

Method 方法	NMVE $\downarrow$		NMJE $\downarrow$		MVE $\downarrow$				MPJPE $\downarrow$
Method 方法	Full-Body 全身	Body 身体	Full-Body 全身	Body 身体	Full-Body 全身	Body 身体	Face 面孔	LH/RH	Full-Body 全身	Body 身体	Face 面孔	LH/RH
SMPLify-X [49] SMPLify-X [ 49］	333.1	263.3	326.5	256.5	236.5	187.0	48.9	48.3/51.4	231.8	182.1	52.9	46.5/49.6
ExPose [48] ExPose [ 48］	265.0	184.8	263.3	183.4	217.3	151.5	51.1	74.9/71.3	215.9	150.4	55.2	72.5/68.8
FrankMocap [56] 弗兰克-莫卡普 [ 56］	-	207.8	-	204.0	-	168.3	-	54.7/55.7	-	165.2	-	52.3/53.1
PIXIE [16] PIXIE [ 16］	233.9	173.4	230.9	171.1	191.8	142.2	50.2	49.5/49.0	189.3	140.3	54.5	46.4/46.0
Hand4Whole [39] ^† 手4全 [ 39] ^†	144.1	96.0	141.1	92.7	135.5	90.2	41.6	46.3/48.1	132.6	87.1	46.1	44.3/46.2
PyMAF-X [70] ^†	141.2	94.4	140.0	93.5	125.7	84.0	35.0	44.6/45.6	124.6	83.2	37.9	42.5/43.7
OSX (Ours) ^† OSX（我们的） ^†	130.6 $\downarrow_{7.5\%}$	85.3 $\downarrow_{9.6\%}$	127.6 $\downarrow_{8.9\%}$	83.3 $\downarrow_{10.9\%}$	122.8	80.2	36.2	45.4/46.1	119.9	78.3	37.9	43.0/43.9

Table S-2: Reconstruction errors on the AGORA test set. ^† denotes the methods that are fine-tuned on the AGORA training set or similarly synthetic data [31]. The best results are shown in bold and the second best results are highlighted with underlined font.
表 S-2：AGORA 测试集上的重建误差。0#表示在 AGORA 训练集或类似合成数据[31]上进行微调的方法。最佳结果以粗体显示，次佳结果以下划线字体显示。

C Experiment on AGORA Dataset
AGORA 数据集实验

In this part, we report the complete result on the AGORA test set and the experiment result on the AGORA val set.
在本部分中，我们报告了 AGORA 测试集的完整结果和 AGORA val 集的实验结果。

AGORA Test Set. Table S-2 depicts the complete result on the AGORA test set. All the results are taken from the official leaderboard. As shown, our OSX outperforms other competitors on most metrics, especially on the evaluation of the body and full-body recovery. More specifically, for full-body reconstruction, OSX even surpasses PyMAF-X [70] by 10.6 mm, 9.1 mm, 2.9 mm, and 4.7 mm on NMVE, NMJE, MVE, and MPJPE, respectively. Since PyMAF-X has a lower detected person ratio, they have similar results on MVE and MPJPE metrics, which only calculate the matched person. The NMVE and NMJE will take the misses and false positives into account, and we have overall better multi-person estimation with more improvement under the metrics. Notably, although OSX does not use extra hand-only and face-only datasets, it can achieve competitive results on hand and face metrics, which demonstrates the effectiveness of our component-aware decoder.
AGORA 测试集。表 S-2 显示了 AGORA 测试集的全部结果。所有结果均来自官方排行榜。如图所示，我们的 OSX 在大多数指标上都优于其他竞争对手，尤其是在身体评估和全身复原方面。更具体地说，在全身重建方面，OSX 甚至在 NMVE、NMJE、MVE 和 MPJPE 上分别超过 PyMAF-X [ 70] 10.6 毫米、9.1 毫米、2.9 毫米和 4.7 毫米。由于 PyMAF-X 检测到的人物比率较低，因此它们在 MVE 和 MPJPE 指标上的结果相似，因为这两个指标只计算匹配的人物。而 NMVE 和 NMJE 会将漏检和误检考虑在内，因此我们的多人估算结果整体上更好，在指标上也有更多改进。值得注意的是，虽然 OSX 没有使用额外的纯手和纯脸数据集，但在手和脸指标上也能取得有竞争力的结果，这证明了我们的组件感知解码器的有效性。

AGORA Val Set. Table S-3 shows the result on the AGORA val set. All the results are taken from [39] except OSX. Although we do not use extra hand/face specific datasets during training, OSX outperforms the SOAT method Hand4Whole by 8.3% on the MPVPE-all, demonstrating the effectiveness of our one-stage method.
AGORA 值集。表 S-3 显示了 AGORA 阀值集的结果。除 OSX 外，所有结果均来自 [ 39] 。虽然我们在训练过程中没有使用额外的手/脸特定数据集，但 OSX 在 MPVPE-all 上的成绩比 SOAT 方法 Hand4Whole 高出 8.3%，证明了我们的单阶段方法的有效性。

Method 方法	MPVPE $\downarrow$			PA-MPVPE $\downarrow$
Method 方法	All 全部	Hand 手	Face 面孔	All 全部	Hand 手	Face 面孔
ExPose [48] ExPose [ 48］	219.8	115.4	103.5	88.0	12.1	4.8
FrankMocap [56] 弗兰克-莫卡普 [ 56］	218.0	95.2	105.4	90.6	11.2	4.9
PIXIE [16] PIXIE [ 16］	203.0	89.9	95.4	82.7	12.8	5.4
Hand4Whole [39] 手4全[ 39］	183.9	72.8	81.6	73.2	9.7	4.7
OSX (Ours) OSX （我们的）	168.6 $\downarrow_{8.3\%}$	70.6	77.2	69.4	11.5	4.8

Table S-3: Reconstruction errors on the AGORA val set.
表 S-3：AGORA 阀值集的重建误差。

D UBody: An Upper Body Dataset
DUBody：上半身数据集

D.1 Data Collection D.1 数据收集

To bridge the gap between the basic human mesh recovery task and its downstream applications, we design UBody with two rules. First, we research a wide range of human-related downstream tasks with upper-body scenes, including gesture recognition [65, 19, 38], sign language recognition, and translation [14, 54, 58, 27, 7, 73], person clustering [5], emotion analysis, speaker verification [44], micro-gesture understanding [36], audio-visual generation and separation [53], human action recognition, and localization [50, 18, 55, 57, 17], and human video segmentation [33]. We select the corresponding high-quality datasets from these existing tasks as a part of our data for the corresponding scenarios. In order to ensure a balanced amount of data for each scene, for datasets with many videos (e.g., lasting 20k minutes), we manually selected the videos in which the upper body appeared more frequently.
为了缩小基本人体网格复原任务与其下游应用之间的差距，我们在设计 UBody 时遵循了两条原则。首先，我们研究了大量与人体上半身场景相关的下游任务，包括手势识别[65, 19, 38]、手语识别和翻译[14, 54, 58, 27, 7, 73]、人物聚类[5]、情感分析、说话者验证[44]、微手势理解[36]、视听生成和分离[53]、人体动作识别和定位[50, 18, 55, 57, 17]以及人体视频分割[33]。我们从这些现有任务中选择相应的高质量数据集作为相应场景的数据。为了确保每个场景的数据量均衡，对于视频数量较多（如持续 20k 分钟）的数据集，我们手动选择了上半身出现频率较高的视频。

Second, with all kinds of athletic competitions, entertainment shows, we media, online conferences, and online classes being more and more indispensable, we carefully selected a large number of rich videos from YouTube to provide new opportunities and challenges for potential applications.
其次，随着各种体育比赛、娱乐节目、微媒体、在线会议和在线课堂越来越不可或缺，我们从 YouTube 上精心挑选了大量丰富的视频，为潜在应用提供了新的机遇和挑战。

Since some untrimmed videos may have missing main characters, extraneous images such as opening and closing credits, and repetitive actions, we manually fine-cut the long videos. Each edited video is 10 seconds long, which ensures the high quality of the video.
由于一些未经剪辑的视频可能会出现主角缺失、片头片尾等无关图像以及重复动作等问题，因此我们对长视频进行了人工精剪。每个经过剪辑的视频时长为 10 秒，从而确保了视频的高质量。

In order to prevent infringement of ownership rights, we only provide download links to the corresponding videos and our labels without any personal information.
为了防止侵犯所有权，我们只提供相应视频的下载链接和我们的标签，不提供任何个人信息。

In summary, we collect fifteen real-life scenarios with more than 105,1k frames. We split the train/test sets from two protocols as follows.
总之，我们收集了 15 个真实场景，帧数超过 105,1k 帧。我们将两个协议的训练/测试集拆分如下。

•

Intra-scene: in each scene, the former 70% of the videos are the training set, and the last 30% are the test set. The benchmark was provided in the main paper.

- 场景内：在每个场景中，前 70% 的视频是训练集，后 30% 的视频是测试集。主要论文中提供了基准。
•

Inter-scene: we use ten scenes of the videos as the training set and the other five scenes as the test set. Due to the page limit, we present the benchmark in Table S-4.

- 场景间：我们使用视频中的 10 个场景作为训练集，另外 5 个场景作为测试集。由于篇幅限制，我们在表 S-4 中列出了基准值。

D.2 Data Annotation Processes
D.2 数据注释流程

As shown in Figure S-1, we design a thorough whole-body annotation pipeline with high precision. It is divided into two stages: 2D whole-body keypoint annotation and 3D SMPLX annotations fitting. Since UBody scenes have a number of unpredictable transitions and cutscenes that make it difficult to use the temporal smoothing approaches [66, 52, 68], the annotation is conducted on a single frame.
如图 S-1 所示，我们设计了一个全面、高精度的全身标注流水线。它分为两个阶段：二维全身关键点标注和三维 SMPLX 标注拟合。由于 UBody 场景有许多不可预知的转场和剪辑，很难使用时间平滑方法[66, 52, 68]，因此注释是在单帧上进行的。

2D whole-body keypoint annotation: We first detect all persons and their hands in an image via a specific human and hand detector BodyHands [45] shown as Body Detector and Hand Detector in Figure 4. Leveraging the recent state-of-the-art 2D pose estimator ViT-Body-only [64], we use the pre-trained model trained on the COCO [35] dataset to localize 17 body keypoints for each detected single person, named $K_{Body}$ , which shows highly robust results on many scenes. Due to the diverse scales and motion blur for the fast-moving hands, we find that Hand Detector will output false positive samples or miss some hands. To enhance the performance of hand detection, we train a 2D whole-body estimator on COCO-wholeBody [24] with $133$ 2D keypoints, called ViT-WholeBody following the model design of ViTPose [64] and masked autoencoder pre-trained scheme [20]. ViT-WholeBody can provide high-recall hand keypoints $K_{Hand}$ , but the localization precision is low because of the fully one-stage pipeline and low-resolution of hands from the raw image. Accordingly, We can obtain coarse hand bounding boxes by calculating the maximum, and minimum values of the detected left and right-hand keypoints to correct the hand boxes from Hand Detector via an IoU matching strategy. Then, we use the fine hand boxes to crop the hand patches, resize them to a larger size, and put them into our specific pre-trained ViT-Hand-only model trained with the hand labels from the COCO-Whole dataset. In summary, ViT-WholeBody will output the body, hand, and face 2D keypoints. We use the body output from ViT-Body-only to replace the $K_{Body}$ , and use the fine hand keypoints from ViT-Hand-only to change the $K_{Hand}$ . As the face of the current SMPL-X model does not require much detail, we simply use the 2D face keypoints $K_{Face}$ obtained from ViT-WholeBody.
二维全身关键点标注：我们首先通过特定的人体和手部检测器 BodyHands[ 45] 来检测图像中的所有人及其手部，如图 4 中的人体检测器和手部检测器所示。利用最近最先进的二维姿势估计器 ViT-Body-only [ 64]，我们使用在 COCO [ 35] 数据集上训练的预训练模型，为每个检测到的单人定位 17 个身体关键点，命名为 $K_{Body}$ ，这在许多场景中都显示出高度鲁棒的结果。由于快速移动的手有不同的尺度和运动模糊，我们发现手部检测器会输出假阳性样本或漏检一些手。为了提高手部检测的性能，我们在 COCO-wholeBody [ 24] 的基础上训练了一个 $133$ 2D 关键点的 2D 全身估计器，称为 ViT-WholeBody，它沿用了 ViTPose [ 64] 的模型设计和屏蔽自动编码器预训练方案 [ 20]。ViT-WholeBody 可以提供高识别率的手部关键点 $K_{Hand}$ ，但由于完全采用单级流水线，且原始图像的手部分辨率较低，因此定位精度较低。因此，我们可以通过计算检测到的左右手关键点的最大值和最小值来获得粗略的手部边界框，并通过 IoU 匹配策略修正手部检测器的手部边界框。然后，我们使用精细的手部框来裁剪手部补丁，将其调整为更大的尺寸，并将其放入使用 COCO-Whole 数据集的手部标签训练的特定预训练 ViT-Hand-only 模型中。总之，ViT-WholeBody 会输出身体、手部和面部的二维关键点。我们使用 ViT-Body-only 的身体输出来替换 $K_{Body}$ ，并使用 ViT-Hand-only 的精细手部关键点来更改 $K_{Hand}$ 。由于当前 SMPL-X 模型的面部不需要太多细节，我们只需使用从 ViT-WholeBody 获取的二维面部关键点 $K_{Face}$ 。

3D whole-body mesh recovery annotation: Different from previous optimization-based annotation [49] that may output implausible poses, we use our proposed OSX to estimate the SMPL-X parameters from human images as a proper 3D initialization to provide pseudo-3D constraints. Benefiting from current 2D keypoint localization that tends to be more accurate, we additionally supervise the projected 2D whole-body keypoints by the above annotated 2D whole-body keypoints as a way to train OSX. More importantly, to avoid performance degradation from not accurate enough initial labeling and consistently push up the 3D annotation quality, we propose an iterative training-labeling-revision loop for every 30 epochs to train 120 epochs in total.
三维全身网格复原标注：以往基于优化的标注[49]可能会输出不靠谱的姿势，与此不同，我们使用我们提出的 OSX 从人体图像中估算 SMPL-X 参数作为适当的三维初始化，以提供伪三维约束。目前的二维关键点定位往往更为精确，我们利用这一优势，通过上述已注释的二维全身关键点对投射的二维全身关键点进行监督，以此训练 OSX。更重要的是，为了避免因初始标注不够准确而导致性能下降，并持续提升三维标注质量，我们提出了每 30 个历时进行一次 "训练-标注-修正 "的迭代循环，总共训练 120 个历时。

E Inter-Scene Benchmark on UBody dataset
基于 UBody 数据集的 EInter-Scene 基准测试

Due to the page limit, we further provide another data protocol comparison to show the usage of the proposed UBody. Table S-4 presents the performance comparisons of existing 3D whole-body methods. Inter-scene test shows large errors than the intra-scene test due to the different motion and gesture distributions. The model finetuned on AGORA still has a significant gap than trained on the COCO dataset. Furthermore, we also train Hand4Whole and UBody on our training set, we can find a consistent improvement compared to the original pretrained model, indicating that UBody can serve to bridge the gap among these downstream real-life scenes. Moreover, different from single-frame AGORA and EHF, UBody provides videos, which can drive progress in spatial-temporal modeling on such edit media sources.
由于篇幅所限，我们进一步提供了另一个数据协议比较，以显示所提议的 UBody 的使用情况。表 S-4 列出了现有三维全身方法的性能比较。由于运动和手势分布的不同，场景间测试比场景内测试显示出更大的误差。与在 COCO 数据集上训练的模型相比，在 AGORA 上微调的模型仍有明显差距。此外，我们还在训练集上对 Hand4Whole 和 UBody 进行了训练，我们可以发现与原始预训练模型相比，UBody 有着一致的改进，这表明 UBody 可以弥补这些下游真实场景之间的差距。此外，与单帧 AGORA 和 EHF 不同的是，UBody 提供了视频，这可以推动此类编辑媒体源的时空建模的进展。

Method 方法	MPVPE $\downarrow$			PA-MPVPE $\downarrow$
Method 方法	All 全部	Hand 手	Face 面孔	All 全部	Hand 重试错误原因	Face 面孔
ExPose [48] ExPose [ 48］	185.7	89.5	47.2	76.4	11.8	4.0
PIXIE [16] PIXIE [ 16］	185.0	60.9	45.3	74.5	11.9	4.2
Hand4Whole [39] $\times$ 手4全 [ 39] $\times$	198.1	66.9	51.8	90.2	10.3	4.1
Hand4Whole [39] 手4全[ 39］	109.4	50.4	24.8	57.0	8.9	2.7
Hand4Whole [39] $\dagger$ 手4全 [ 39] $\dagger$	87.4	41.6	22.1	46.3	8.0	2.0
OSX (Ours) OSX （我们的）	100.7	52.5	24.5	52.9	9.5	2.6
OSX (Ours) $\dagger$ OSX（我们的） $\dagger$	82.0	44.2	21.5	44.2	8.8	1.9

Table S-4: Reconstruction errors on UBody test set on the inter-scene protocol. All models are pretrained on previous datasets, except for the results labeled by (i)

\dagger

: finetuned on the UBody training data; (ii)

\times

: finetuned on the AGORA training data.
表 S-4：场景间协议 UBody 测试集的重建误差。所有模型都在以前的数据集上进行了预训练，但标注为 (i)

\dagger

: 在 UBody 训练数据上进行了微调；(ii)

\times

: 在 AGORA 训练数据上进行了微调的结果除外。

F Qualitative with SOTA method
采用 SOTA 方法进行定性

Qualitative comparisons on AGORA: We compare the mesh quality on the AGORA dataset in Figure S-2. Agora is a synthetic dataset with many challenging factors like heavy occlusion, dark environment, and unnatural multi-person interaction. It only has limited actions, e.g., taking phones, walking, sitting, etc. We can see OSX outperforms ExPose [48] and Hand4Whole [39] consistently in terms of global body orientations, whole-body poses, and hand pose.
AGORA 数据集上的定性比较：我们在图 S-2 中比较了 AGORA 数据集上的网格质量。AGORA 是一个人工合成的数据集，其中包含许多具有挑战性的因素，如严重遮挡、黑暗环境和不自然的多人交互。它只有有限的动作，如拿起手机、行走、坐下等。我们可以看到，OSX 在全局身体方位、全身姿势和手部姿势方面始终优于 ExPose [ 48] 和 Hand4Whole [ 39]。

Qualitative comparisons on EHF: The visual comparisons of whole-body mesh recovery quality on the EHF dataset can be found in Figure S-3. As can be seen, OSX estimates the most accurate whole-body poses, in which the body parts like hands, feet, and hands are better aligned with the person in the image.
EHF 的定性比较：图 S-3 是 EHF 数据集上全身网格复原质量的直观比较。可以看出，OSX 估算的全身姿势最为准确，其中手、脚和手等身体部位与图像中的人物对齐度较高。

Qualitative comparisons on UBody: The qualitative comparison on our UBody is in Figure S-4. UBody focuses more on the expressive upper body part. Hand4Whole [39] and our OSX produces better body mesh recoveries than ExPose [64]. Close inspection of the hand part shows that our hand recovery is more accurate than Hand4Whole.
UBody 的定性比较：我们的 UBody 的定性比较见图 S-4。UBody 更注重上半身的表现力。与 ExPose [ 64] 相比，Hand4Whole [ 39] 和我们的 OSX 能够生成更好的身体网格复原效果。仔细观察手部可以发现，我们的手部复原比 Hand4Whole 更准确。

Visualization of our annotation on UBody: The visualizations of our SMPL-X annotation in our UBody can be found in Figure S-5, S-6, and S-7. Our annotation produces high-quality ground truth. In many challenging cases of expressive hand poses, our estimated mesh can capture fine-level details.
我们在 UBody 上的注释可视化：图 S-5、S-6 和 S-7 是我们的 SMPL-X 注释在 UBody 上的可视化。我们的标注产生了高质量的地面实况。在许多富有表现力的手部姿势的挑战性案例中，我们估计的网格都能捕捉到精细的细节。

One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer利用组件感知变压器进行单级三维全身网格复原

Abstract 内容摘要

1 Introduction 1引言

2 Related Work 2 相关工作

2.1 Methods of Whole-body Mesh Recovery2.1 全身网格复原方法

2.2 Benchmarks of Expressive Body2.2 表达体的基准