Perpetual Humanoid Control for Real-time Simulated Avatars
实时模拟化身的永久人形控制

Zhengyi Luo^1,2 Jinkun Cao² Alexander Winkler¹ Kris Kitani^1,2 Weipeng Xu¹
罗正义 ^1,2 曹金坤 ² 亚历山大·温克勒 ¹ 克里斯·基塔尼 ^1,2 徐伟鹏 ¹
¹Reality Labs Research, Meta; ²Carnegie Mellon University
¹ 现实实验室研究，元； ² 卡内基梅隆大学
https://zhengyiluo.github.io/PHC/

Abstract 抽象的

We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.
我们提出了一种基于物理的人形控制器，可以在存在噪声输入（例如，来自视频或由语言生成的姿势估计）和意外跌倒的情况下实现高保真运动模仿和容错行为。我们的控制器可以在不使用任何外部稳定力的情况下扩展到学习一万个运动片段，并学会从故障状态自然恢复。给定参考运动，我们的控制器可以永久控制模拟化身，无需重置。其核心是，我们提出了渐进乘法控制策略（PMCP），它动态分配新的网络容量来学习越来越难的运动序列。 PMCP 允许高效扩展，以便从大规模运动数据库中学习并添加新任务（例如故障状态恢复），而不会发生灾难性遗忘。我们通过使用控制器在实时多人头像用例中模仿基于视频的姿势估计器和基于语言的运动生成器的噪声姿势来展示控制器的有效性。

Figure 1: We propose a motion imitator that can naturally recover from falls and walk to far-away reference motion, perpetually controlling simulated avatars without requiring reset. Left: real-time avatars from video, where the blue humanoid recovers from a fall. Right: Imitating 3 disjoint clips of motion generated from language, where our controller fills in the blank. The color gradient indicates the passage of time.
图 1：我们提出了一种运动模仿器，它可以自然地从跌倒中恢复并走到远处的参考运动，永久控制模拟化身而不需要重置。左：视频中的实时头像，蓝色人形从跌倒中恢复过来。右图：模仿由语言生成的 3 个不相交的运动片段，我们的控制器填补了空白。颜色渐变表示时间的流逝。

\etocdepthtag \etoc深度标签

.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone
.tocmtchapter \etocsettag深度mtchaptersubsection \etocsettag深度mtappendixnone

1 Introduction 1简介

Physics-based motion imitation has captured the imagination of vision and graphics communities due to its potential for creating realistic human motion, enabling plausible environmental interactions, and advancing virtual avatar technologies of the future. However, controlling high-degree-of-freedom (DOF) humanoids in simulation presents significant challenges, as they can fall, trip, or deviate from their reference motions, and struggle to recover. For example, controlling simulated humanoids using poses estimated from noisy video observations can often lead humanoids to fall to the ground[50, 51, 22, 24]. These limitations prevent the widespread adoption of physics-based methods, as current control policies cannot handle noisy observations such as video or language.
基于物理的运动模仿因其在创造逼真的人体运动、实现合理的环境交互以及推进未来虚拟化身技术方面的潜力而吸引了视觉和图形界的想象力。然而，在模拟中控制高自由度 (DOF) 类人机器人面临着巨大的挑战，因为它们可能会跌倒、绊倒或偏离参考运动，并且难以恢复。例如，使用从嘈杂的视频观察中估计的姿势来控制模拟人形机器人通常会导致人形机器人跌倒在地[50,51,22,24]。这些限制阻碍了基于物理的方法的广泛采用，因为当前的控制策略无法处理视频或语言等嘈杂的观察结果。

In order to apply physically simulated humanoids for avatars, the first major challenge is learning a motion imitator (controller) that can faithfully reproduce human-like motion with a high success rate. While reinforcement learning (RL)-based imitation policies have shown promising results, successfully imitating motion from a large dataset, such as AMASS (ten thousand clips, 40 hours of motion), with a single policy has yet to be achieved. Attempts to use larger or a mixture of expert policies have been met with some success [45, 47], although they have not yet scaled to the largest dataset. Therefore, researchers have resorted to using external forces to help stabilize the humanoid. Residual force control (RFC) [52] has helped to create motion imitators that can mimic up to 97% of the AMASS dataset [22], and has seen successful applications in human pose estimation from video[54, 23, 12] and language-based motion generation [53]. However, the external force compromises physical realism by acting as a “hand of God” that puppets the humanoid, leading to artifacts such as flying and floating. One might argue that, with RFC, the realism of simulation is compromised, as the model can freely apply a non-physical force on the humanoid.
为了将物理模拟的类人机器人应用于化身，第一个主要挑战是学习一个能够以高成功率忠实地再现类人运动的运动模仿器（控制器）。虽然基于强化学习 (RL) 的模仿策略已显示出可喜的结果，但使用单一策略成功模仿大型数据集（例如 AMASS（一万个剪辑，40 小时的运动））的运动尚未实现。使用更大或混合专家策略的尝试已经取得了一些成功[45, 47]，尽管它们尚未扩展到最大的数据集。因此，研究人员求助于使用外力来帮助稳定人形机器人。残余力控制 (RFC) [52] 帮助创建了运动模仿器，可以模仿高达 97% 的 AMASS 数据集 [22]，并且在视频 [54,23,12] 和语言的人体姿势估计中得到了成功的应用基于运动生成[53]。然而，外力通过充当傀儡人形的“上帝之手”而损害了物理真实感，导致诸如飞行和漂浮之类的人工制品。有人可能会说，使用 RFC，模拟的真实性会受到影响，因为模型可以自由地对人形机器人施加非物理力。

Another important aspect of controlling simulated humanoids is how to handle noisy input and failure cases. In this work, we consider human poses estimated from video or language input. Especially with respect to video input, artifacts such as floating [53], foot sliding [57], and physically impossible poses are prevalent in popular pose estimation methods due to occlusion, challenging view point and lighting, fast motions etc. To handle these cases, most physics-based methods resort to resetting the humanoid when a failure condition is triggered [24, 22, 51]. However, resetting successfully requires a high-quality reference pose, which is often difficult to obtain due to the noisy nature of the pose estimates, leading to a vicious cycle of falling and resetting to unreliable poses. Thus, it is important to have a controller that can gracefully handle unexpected falls and noisy input, naturally recover from fail-state, and resume imitation.
控制模拟人形机器人的另一个重要方面是如何处理噪声输入和故障情况。在这项工作中，我们考虑从视频或语言输入估计的人体姿势。特别是在视频输入方面，由于遮挡、具有挑战性的视点和照明、快速运动等，诸如浮动[53]、脚滑动[57]和物理上不可能的姿势等伪像在流行的姿势估计方法中普遍存在。，大多数基于物理的方法在触发故障条件时都会重置人形机器人[24,22,51]。然而，成功重置需要高质量的参考姿势，由于姿势估计的噪声性质，通常很难获得高质量的参考姿势，从而导致跌倒和重置为不可靠姿势的恶性循环。因此，拥有一个能够优雅地处理意外跌倒和噪声输入、自然地从故障状态恢复并恢复模仿的控制器非常重要。

In this work, our aim is to create a humanoid controller specifically designed to control real-time virtual avatars, where video observations of a human user are used to control the avatar. We design the Perpetual Humanoid Controller (PHC), a single policy that achieves a high success rate on motion imitation and can recover from fail-state naturally. We propose a progressive multiplicative control policy (PMCP) to learn from motion sequences in the entire AMASS dataset without suffering catastrophic forgetting. By treating harder and harder motion sequences as a different “task” and gradually allocating new network capacity to learn, PMCP retains its ability to imitate easier motion clips when learning harder ones. PMCP also allows the controller to learn fail-state recovery tasks without compromising its motion imitation capabilities. Additionally, we adopt Adversarial Motion Prior (AMP)[35] throughout our pipeline and ensure natural and human-like behavior during fail-state recovery. Furthermore, while most motion imitation methods require both estimates of link position and rotation as input, we show that we can design controllers that require only the link positions. This input can be generated more easily by vision-based 3D keypoint estimators or 3D pose estimates from VR controllers.
在这项工作中，我们的目标是创建一个专门设计用于控制实时虚拟化身的人形控制器，其中使用人类用户的视频观察来控制化身。我们设计了永久人形控制器（PHC），它是一种单一策略，可以在运动模仿方面实现很高的成功率，并且可以自然地从失败状态中恢复。我们提出了一种渐进乘法控制策略（PMCP），可以从整个 AMASS 数据集中的运动序列中学习，而不会遭受灾难性遗忘。通过将越来越难的运动序列视为不同的“任务”并逐渐分配新的网络学习能力，PMCP 保留了在学习较难的运动片段时模仿较容易的运动片段的能力。 PMCP 还允许控制器学习故障状态恢复任务，而不会影响其运动模仿功能。此外，我们在整个流程中采用对抗性运动先验（AMP）[35]，并确保故障状态恢复期间的自然和类似人类的行为。此外，虽然大多数运动模仿方法都需要链接位置和旋转的估计作为输入，但我们表明我们可以设计仅需要链接位置的控制器。通过基于视觉的 3D 关键点估计器或 VR 控制器的 3D 姿态估计可以更轻松地生成此输入。

To summarize, our contributions are as follows: (1) we propose a Perpetual Humanoid Controller that can successfully imitate 98.9% of the AMASS dataset without applying any external forces; (2) we propose the progressive multiplicative control policy to learn from a large motion dataset without catastrophic forgetting and unlock additional capabilities such as fail-state recovery; (3) our controller is task-agnostic and is compatible with off-the-shelf video-based pose estimators as a drop-in solution. We demonstrate the capabilities of our controller by evaluating on both Motion Capture (MoCap) and estimated motion from videos. We also show a live (30 fps) demo of driving perpetually simulated avatars using a webcam video as input.
总而言之，我们的贡献如下：（1）我们提出了一种永久人形控制器，可以在不施加任何外力的情况下成功模仿 98.9% 的 AMASS 数据集；（2）我们提出渐进乘法控制策略，从大型运动数据集中学习，而不会发生灾难性遗忘，并解锁诸如故障状态恢复等附加功能； (3) 我们的控制器与任务无关，并且与现成的基于视频的姿态估计器兼容，作为即插即用的解决方案。我们通过评估运动捕捉 (MoCap) 和视频中的估计运动来展示控制器的功能。我们还展示了使用网络摄像头视频作为输入来驾驶永久模拟化身的实时（30 fps）演示。

2 Related Works 2相关作品

Physics-based Motion Imitation
基于物理的运动模仿

Governed by the laws of physics, simulated characters [32, 31, 33, 35, 34, 7, 45, 52, 28, 13, 2, 11, 46, 12] have the distinct advantage of creating natural human motion, human-to-human interaction [20, 48], and human-object interactions [28, 34]. Since most modern physics simulators are not differentiable, training these simulated agents requires RL, which is time-consuming & costly. As a result, most of the work focuses on small-scale use cases such as interactive control based on user input [45, 2, 35, 34], playing sports [48, 20, 28], or other modular tasks (reaching goals [49], dribbling [35], moving around [32], etc.). On the other hand, imitating large-scale motion datasets is a challenging yet fundamental task, as an agent that can imitate reference motion can be easily paired with a motion generator to achieve different tasks. From learning to imitate a single clip [31] to datasets [47, 45, 7, 44], motion imitators have demonstrated their impressive ability to imitate reference motion, but are often limited to imitating high-quality MoCap data. Among them, ScaDiver [47] uses a mixture of expert policy to scale up to the CMU MoCap dataset and achieves a success rate of around 80% measured by time to failure. Unicon[45] shows qualitative results in imitation and transfer, but does not quantify the imitator’s ability to imitate clips from datasets. MoCapAct[44] first learns single-clip experts on the CMU MoCap dataset, and distills them into a single that achieves around 80% of the experts’ performance. The effort closest to ours is UHC [22], which successfully imitates 97% of the AMASS dataset. However, UHC uses residual force control [51], which applies a non-physical force at the root of the humanoid to help balance. Although effective in preventing the humanoid from falling, RFC reduces physical realism and creates artifacts such as floating and swinging, especially when motion sequences become challenging [22, 23]. Compared to UHC, our controller does not utilize any external force.
受物理定律支配，模拟角色[32,31,33,35,34,7,45,52,28,13,2,11,46,12]具有创造自然人体运动、人体运动的独特优势。人与人的交互[20, 48]，以及人与物体的交互[28, 34]。由于大多数现代物理模拟器都是不可微分的，因此训练这些模拟代理需要强化学习，这既耗时又昂贵。因此，大部分工作都集中在小规模用例上，例如基于用户输入的交互控制[45,2,35,34]，进行运动[48,20,28]或其他模块化任务（达到目标） [ 49]、运球 [ 35]、四处走动 [ 32] 等）。另一方面，模仿大规模运动数据集是一项具有挑战性但又基本的任务，因为可以模仿参考运动的代理可以轻松地与运动生成器配对以实现不同的任务。从学习模仿单个剪辑[31]到数据集[47,45,7,44]，运动模仿者已经展示了他们模仿参考运动的令人印象深刻的能力，但通常仅限于模仿高质量的MoCap数据。其中，ScaDiver [47] 使用混合专家策略来扩展到 CMU MoCap 数据集，并实现了按失败时间衡量的 80% 左右的成功率。 Unicon[45]显示了模仿和迁移的定性结果，但没有量化模仿者模仿数据集中剪辑的能力。 MoCapAct[44]首先在 CMU MoCap 数据集上学习单片段专家，并将它们提炼成单个片段，以达到专家性能的 80% 左右。最接近我们的成果是 UHC [22]，它成功模仿了 AMASS 数据集的 97%。然而，UHC 使用残余力控制 [51]，它在人形机器人的根部施加非物理力来帮助平衡。尽管 RFC 可以有效防止人形机器人跌倒，但它会降低物理真实感并产生漂浮和摆动等伪影，特别是当运动序列变得具有挑战性时 [22, 23]。与 UHC 相比，我们的控制器不利用任何外力。

Fail-state Recovery for Simulated Characters
模拟角色的故障状态恢复

As simulated characters can easily fall when losing balance, many approaches [39, 51, 34, 42, 7] have been proposed to help recovery. PhysCap [39] uses a floating-base humanoid that does not require balancing. This compromises physical realism, as the humanoid is no longer properly simulated. Egopose [51] designs a fail-safe mechanism to reset the humanoid to the kinematic pose when it is about to fall, leading to potential teleport behavior in which the humanoid keeps resetting to unreliable kinematic poses. NeruoMoCon [14] utilizes sampling-based control and reruns the sampling process if the humanoid falls. Although effective, this approach does not guarantee success and prohibits real-time use cases. Another natural approach is to use an additional recovery policy [7] when the humanoid has deviated from the reference motion. However, since such a recovery policy no longer has access to the reference motion, it produces unnatural behavior, such as high-frequency jitters. To combat this, ASE [34] demonstrates the ability to rise naturally from the ground for a sword-swinging policy. While impressive, in motion imitation the policy not only needs to get up from the ground, but also goes back to tracking the reference motion. In this work, we propose a comprehensive solution to the fail-state recovery problem in motion imitation: our PHC can rise from fallen state and naturally walks back to the reference motion and resume imitation.
由于模拟角色在失去平衡时很容易跌倒，因此提出了许多方法[39,51,34,42,7]来帮助恢复。 PhysCap [39] 使用不需要平衡的浮基人形机器人。这会损害物理真实感，因为人形机器人不再被正确模拟。 Egopose [51]设计了一种自动防故障机制，可在人形机器人即将跌倒时将其重置为运动学姿势，从而导致潜在的瞬移行为，其中人形机器人不断重置为不可靠的运动学姿势。 NeruoMoCon [14] 利用基于采样的控制，并在人形机器人跌倒时重新运行采样过程。尽管有效，但这种方法并不能保证成功，并且会阻碍实时用例。另一种自然的方法是当人形机器人偏离参考运动时使用额外的恢复策略[7]。然而，由于这种恢复策略不再能够访问参考运动，因此它会产生不自然的行为，例如高频抖动。为了解决这个问题，ASE [34]展示了从地面自然崛起的能力，以实施挥剑政策。虽然令人印象深刻，但在运动模仿中，该策略不仅需要从地面站起来，而且还需要返回跟踪参考运动。在这项工作中，我们提出了一种针对运动模仿中失败状态恢复问题的全面解决方案：我们的PHC可以从跌倒状态中上升，自然地走回参考运动并恢复模仿。

Progressive Reinforcement Learning
渐进强化学习

When learning from data containing diverse patterns, catastrophic forgetting [9, 27] is observed when attempting to perform multi-task or transfer learning by fine-tuning. Various approaches [8, 16, 18] have been proposed to combat this phenomenon, such as regularizing the weights of the network [18], learning multiple experts [16], or increasing the capacity using a mixture of experts [56, 38, 47] or multiplicative control [33]. A paradigm has been studied in transfer learning and domain adaption as progressive learning [6, 4] or curriculum learning [1]. Recently, progressive reinforcement learning [3] has been proposed to distill skills from multiple expert policies. It aims to find a policy that best matches the action distribution of experts instead of finding an optimal mix of experts. Progressive Neural Networks (PNN) [36] proposes to avoid catastrophic forgetting by freezing the weights of the previously learned subnetworks and initializing additional subnetworks to learn new tasks. The experiences from previous subnetworks are forwarded through lateral connections. PNN requires manually choosing which subnetwork to use based on the task, preventing it from being used in motion imitation since reference motion does not have the concept of task labels.
当从包含不同模式的数据中学习时，在尝试通过微调执行多任务或迁移学习时，会观察到灾难性遗忘 [9, 27]。人们提出了各种方法[8,16,18]来对抗这种现象，例如规范网络的权重[18]，学习多个专家[16]，或使用混合专家来增加容量[56,38， 47]或乘法控制[33]。迁移学习和领域适应的范式被研究为渐进式学习 [6, 4] 或课程学习 [1]。最近，有人提出渐进强化学习[3]来从多个专家策略中提取技能。它的目的是找到最匹配专家行动分布的策略，而不是寻找最佳的专家组合。渐进神经网络（PNN）[36]提出通过冻结先前学习的子网络的权重并初始化额外的子网络来学习新任务来避免灾难性遗忘。先前子网络的经验通过横向连接转发。 PNN需要根据任务手动选择使用哪个子网络，由于参考运动没有任务标签的概念，因此无法将其用于运动模仿。

3 Method 3方法

We define the reference pose as ${\bm{\hat{q}}_{t}}\triangleq({\bm{\hat{\theta}}_{t}},{\bm{\hat{p}}_{t}})$ , consisting of 3D joint rotation ${\bm{\hat{\theta}}_{t}}\in\mathbb{R}^{J\times 6}$ and position ${\bm{\hat{p}}_{t}}\in\mathbb{R}^{J\times 3}$ of all $J$ links on the humanoid (we use the 6 DoF rotation representation [55]). From reference poses ${\bm{\hat{q}}_{1:T}}$ , one can compute the reference velocities ${\bm{\hat{\dot{q}}}_{1:T}}$ through finite difference, where ${\bm{\hat{\dot{q}}}_{t}}\triangleq({\bm{\hat{\omega}}_{t}},{\bm{\hat{v}}_{t}})$ consist of angular ${\bm{\hat{\omega}}_{t}}\in\mathbb{R}^{J\times 3}$ and linear velocities ${\bm{\hat{v}}_{t}}\in\mathbb{R}^{J\times 3}$ . We differentiate rotation-based and keypoint-based motion imitation by input: rotation-based imitation relies on reference poses ${\bm{\hat{q}}_{1:T}}$ (both rotation and keypoints), while keypoint-based imitation only requires 3D keypoints ${\bm{\hat{p}}_{1:T}}$ . As a notation convention, we use $\widetilde{\cdot}$ to represent kinematic quantities (without physics simulation) from pose estimator/keypoint detectors, $\widehat{\cdot}$ to denote ground truth quantities from Motion Capture (MoCap), and normal symbols without accents for values from the physics simulation. We use “imitate”, “track”, and “mimic” reference motion interchangeably. In Sec.3.1, we first set up the preliminary of our main framework. Sec.3.2 describes our progressive multiplicative control policy to learn to imitate a large dataset of human motion and recover from fail-states. Finally, in Sec.3.3, we briefly describe how we connect our task-agnostic controller to off-the-shelf video pose estimators and generators for real-time use cases.
我们将参考姿势定义为 ${\bm{\hat{q}}_{t}}\triangleq({\bm{\hat{\theta}}_{t}},{\bm{\hat{p}}_{t}})$ ，由人形机器人上所有 $J$ 链接的 3D 关节旋转 ${\bm{\hat{\theta}}_{t}}\in\mathbb{R}^{J\times 6}$ 和位置 ${\bm{\hat{p}}_{t}}\in\mathbb{R}^{J\times 3}$ 组成（我们使用 6 DoF 旋转表示[55]）。根据参考位姿 ${\bm{\hat{q}}_{1:T}}$ ，可以通过有限差分计算参考速度 ${\bm{\hat{\dot{q}}}_{1:T}}$ ，其中 ${\bm{\hat{\dot{q}}}_{t}}\triangleq({\bm{\hat{\omega}}_{t}},{\bm{\hat{v}}_{t}})$ 由角 ${\bm{\hat{\omega}}_{t}}\in\mathbb{R}^{J\times 3}$ 和线速度 < b8> 。我们通过输入区分基于旋转和基于关键点的运动模仿：基于旋转的模仿依赖于参考姿势 ${\bm{\hat{q}}_{1:T}}$ （旋转和关键点），而基于关键点的模仿仅需要3D关键点 ${\bm{\hat{p}}_{1:T}}$ 表示来自位姿估计器/关键点检测器的运动学量（没有物理模拟）， $\widehat{\cdot}$ 表示来自运动捕捉 (MoCap) 的地面实况量，以及物理模拟中的值不带重音符号的正常符号。我们交替使用“模仿”、“跟踪”和“模仿”参考运动。在 Sec.3.1 中，我们首先建立了我们的主要框架的初步部分。第 3.2 节描述了我们的渐进乘法控制策略，用于学习模仿大型人体运动数据集并从失败状态中恢复。最后，在第 3.3 节中，我们简要描述了如何将与任务无关的控制器连接到现成的视频姿态估计器和生成器以实现实时用例。

3.1 Goal Conditioned Motion Imitation with Adversarial Motion Prior
3.1 具有对抗性运动先验的目标条件运动模仿

Our controller follows the general framework of goal-conditioned RL (Fig.3), where a goal-conditioned policy ${\pi_{\text{PHC}}}$ is tasked to imitate reference motion $\bm{\hat{q}_{1:t}}$ or keypoints ${\bm{\hat{p}}_{1:T}}$ . Similar to prior work [22, 31], we formulate the task as a Markov Decision Process (MDP) defined by the tuple ${\mathcal{M}}=\langle\mathcal{\bm{S}},\mathcal{\bm{A}},\mathcal{\bm{T}},\bm{\mathcal{R}},\gamma\rangle$ of states, actions, transition dynamics, reward function, and discount factor. The physics simulation determines state ${\bm{s}_{t}}\in\mathcal{\bm{S}}$ and transition dynamics $\mathcal{\bm{T}}$ while our policy ${\pi_{\text{PHC}}}$ computes per-step action $\bm{a}_{t}\in\mathcal{\bm{A}}$ . Based on the simulation state ${\bm{s}_{t}}$ and reference motion ${\bm{\hat{q}}_{t}}$ , the reward function $\bm{\mathcal{R}}$ computes a reward $r_{t}=\bm{\mathcal{R}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}})$ as the learning signal for our policy. The policy’s goal is to maximize the discounted reward $\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t-1}r_{t}\right]$ , and we use the proximal policy gradient (PPO) [37] to learn ${\pi_{\text{PHC}}}$ .
我们的控制器遵循目标条件强化学习的总体框架（图 3），其中目标条件策略 ${\pi_{\text{PHC}}}$ 的任务是模仿参考运动 $\bm{\hat{q}_{1:t}}$ 或关键点 ${\bm{\hat{p}}_{1:T}}$ 定义的马尔可夫决策过程 (MDP)。物理模拟确定状态 ${\bm{s}_{t}}\in\mathcal{\bm{S}}$ 和转换动态 $\mathcal{\bm{T}}$ ，而我们的策略 ${\pi_{\text{PHC}}}$ 计算每步操作 $\bm{a}_{t}\in\mathcal{\bm{A}}$ 。基于模拟状态 ${\bm{s}_{t}}$ 和参考运动 ${\bm{\hat{q}}_{t}}$ ，奖励函数 $\bm{\mathcal{R}}$ 计算奖励 $r_{t}=\bm{\mathcal{R}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}})$ 作为我们策略的学习信号。该策略的目标是最大化折扣奖励 $\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t-1}r_{t}\right]$ ，我们使用近端策略梯度（PPO）[37]来学习 ${\pi_{\text{PHC}}}$ 。

Refer to caption — Figure 2: Our progressive training procedure to train primitives $\bm{\mathcal{P}}^{(1)},\bm{\mathcal{P}}^{(2)},\cdots,\bm{\mathcal{P}}^{(K)}$ by gradually learning harder and harder sequences. Fail recovery $\bm{\mathcal{P}}^{(F)}$ is trained in the end on simple locomotion data; a composer is then trained to combine these frozen primitives.
图 2：我们通过逐渐学习越来越难的序列来训练基元 $\bm{\mathcal{P}}^{(1)},\bm{\mathcal{P}}^{(2)},\cdots,\bm{\mathcal{P}}^{(K)}$ 的渐进式训练过程。失败恢复 $\bm{\mathcal{P}}^{(F)}$ 最终在简单的运动数据上进行训练；然后，作曲家接受训练来组合这些冻结的基元。

State 状态

The simulation state ${\bm{s}_{t}}\triangleq({\bm{s}^{\text{p}}_{t}},\bm{s}^{\text{g}}_{t})$ consists of humanoid proprioception ${\bm{s}^{\text{p}}_{t}}$ and the goal state ${\bm{s}^{\text{g}}_{t}}$ . Proprioception ${\bm{s}^{\text{p}}_{t}}\triangleq({\bm{{q}}_{t}},{\bm{\dot{q}}_{t}},\bm{\beta})$ contains the 3D body pose ${\bm{{q}}_{t}}$ , velocity ${\bm{\dot{q}}_{t}}$ , and (optionally) body shapes $\bm{\beta}$ . When trained with different body shapes, $\bm{\beta}$ contains information about the length of the limb of each body link [24]. For rotation-based motion imitation, the goal state ${\bm{s}^{\text{g}}_{t}}$ is defined as the difference between the next time step reference quantitives and their simulated counterpart:
模拟状态 ${\bm{s}_{t}}\triangleq({\bm{s}^{\text{p}}_{t}},\bm{s}^{\text{g}}_{t})$ 由人形本体感受 ${\bm{s}^{\text{p}}_{t}}$ 和目标状态 ${\bm{s}^{\text{g}}_{t}}$ 组成。本体感觉 ${\bm{s}^{\text{p}}_{t}}\triangleq({\bm{{q}}_{t}},{\bm{\dot{q}}_{t}},\bm{\beta})$ 包含 3D 身体姿势 ${\bm{{q}}_{t}}$ 、速度 ${\bm{\dot{q}}_{t}}$ 和（可选）身体形状 $\bm{\beta}$ 。当使用不同的身体形状进行训练时， $\bm{\beta}$ 包含有关每个身体链接的肢体长度的信息[24]。对于基于旋转的运动模仿，目标状态 ${\bm{s}^{\text{g}}_{t}}$ 定义为下一时间步参考量与其模拟对应量之间的差异：

{\bm{s}^{\text{g-rot}}_{t}}\triangleq({\bm{\hat{\theta}}_{t+1}}\ominus{\bm{{\theta}}_{t}},{\bm{\hat{p}}_{t+1}}-{\bm{{p}}_{t}},\bm{\hat{v}}_{t+1}-\bm{v}_{t},\bm{\hat{\omega}}_{t}-\bm{\omega}_{t},{\bm{\hat{\theta}}_{t+1}},{\bm{\hat{p}}_{t+1}})

where $\ominus$ calculates the rotation difference. For keypoint-only imtiation, the goal state becomes
其中 $\ominus$ 计算旋转差。对于仅关键点模仿，目标状态变为

{\bm{s}^{\text{g-kp}}_{t}}\triangleq({\bm{\hat{p}}_{t+1}}-{\bm{{p}}_{t}},\bm{\hat{v}}_{t+1}-{\bm{v}_{t}},{\bm{\hat{p}}_{t+1}}).

All of the above quantities in ${\bm{s}^{\text{g}}_{t}}$ and ${\bm{s}^{\text{p}}_{t}}$ are normalized with respect to the humanoid’s current facing direction and root position [49, 22].
${\bm{s}^{\text{g}}_{t}}$ 和 ${\bm{s}^{\text{p}}_{t}}$ 中的所有上述量均根据人形机器人当前的面向方向和根位置进行归一化 [49, 22]。

Reward 报酬

Unlike prior motion tracking policies that only use a motion imitation reward, we use the recently proposed Adversarial Motion Prior [35] and include a discriminator reward term throughout our framework. Including the discriminator term helps our controller produce stable and natural motion and is especially crucial in learning natural fail-state recovery behaviors. Specifically, our reward is defined as the sum of a task reward $r^{\text{g}}_{t}$ , a style reward $r^{\text{amp}}_{t}$ , and an additional energy penalty $r^{\text{energy}}_{t}$ [31]:
与仅使用运动模仿奖励的先前运动跟踪策略不同，我们使用最近提出的对抗性运动先验[35]，并在整个框架中包含判别器奖励项。包含鉴别器项有助于我们的控制器产生稳定和自然的运动，对于学习自然故障状态恢复行为尤其重要。具体来说，我们的奖励被定义为任务奖励 $r^{\text{g}}_{t}$ 、风格奖励 $r^{\text{amp}}_{t}$ 和额外能量惩罚 $r^{\text{energy}}_{t}$ 的总和[31]：

r^{\text{}}_{t}=0.5r^{\text{g}}_{t}+0.5r^{\text{amp}}_{t}+r^{\text{energy}}_{t}.

(1)

For the discriminator, we use the same observations, loss formulation, and gradient penalty as AMP [35]. The energy penalty is expressed as $-0.0005\cdot\sum_{j\in\text{ joints }}\left|\bm{\mu_{j}}\bm{\omega_{j}}\right|^{2}$ where $\bm{\mu_{j}}$ and $\bm{\omega_{j}}$ correspond to the joint torque and the joint angular velocity, respectively. The energy penalty [10] regulates the policy and prevents high-frequency jitter of the foot that can manifest in a policy trained without external force (see Sec.4.1). The task reward is defined based on the current training objective, which can be chosen by switching the reward function for motion imitation $\bm{\mathcal{R}}^{\text{imitation}}$ and fail-state recovery $\bm{\mathcal{R}}^{\text{recover}}$ . For motion tracking, we use:
对于判别器，我们使用与 AMP [35] 相同的观察结果、损失公式和梯度惩罚。能量损失表示为 $-0.0005\cdot\sum_{j\in\text{ joints }}\left|\bm{\mu_{j}}\bm{\omega_{j}}\right|^{2}$ ，其中 $\bm{\mu_{j}}$ 和 $\bm{\omega_{j}}$ 分别对应于关节扭矩和关节角速度。能量惩罚[10]调节策略并防止脚的高频抖动，这种抖动可以在没有外力训练的策略中体现出来（参见第 4.1 节）。任务奖励是根据当前训练目标定义的，可以通过切换运动模仿 $\bm{\mathcal{R}}^{\text{imitation}}$ 和失败状态恢复 $\bm{\mathcal{R}}^{\text{recover}}$ 的奖励函数来选择。对于运动跟踪，我们使用：

		$\displaystyle r^{\text{g-imitation}}_{t}=\bm{\mathcal{R}}^{\text{imitation}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}})=w_{\text{jp}}e^{-100\\|\bm{\hat{p}}_{t}-\bm{p}_{t}\\|}$		(2)
		$\displaystyle+w_{\text{jr}}e^{-10\\|\bm{\hat{q}}_{t}\ominus\bm{q}_{t}\\|}+w_{\text{jv}}e^{-0.1\\|\bm{\hat{v}}_{t}-\bm{v}_{t}\\|}+w_{\text{j}\omega}e^{-0.1\\|\bm{\hat{\omega}}_{t}-\bm{\omega}_{t}\\|}$		(2)

where we measure the difference between the translation, rotation, linear velocity, and angular velocity of the rigid body for all links in the humanoid. For fail-state recovery, we define the reward $r^{\text{g-recover}}_{t}$ in Eq.3.
我们测量人形机器人中所有连杆刚体的平移、旋转、线速度和角速度之间的差异。对于失败状态恢复，我们在等式3中定义奖励 $r^{\text{g-recover}}_{t}$ 。

Action 行动

We use a proportional derivative (PD) controller at each DoF of the humanoid and the action $\bm{a}_{t}$ specifies the PD target. With the target joint set as $\bm{q}_{t}^{d}={\bm{a}_{t}}$ , the torque applied at each joint is $\bm{\tau}^{i}=\bm{k}^{p}\circ\left({\bm{a}_{t}}-\bm{q}_{t}\right)-\bm{k}^{d}\circ\dot{\bm{q}}_{t}$ . Notice that this is different from the residual action representation [52, 22, 30] used in prior motion imitation methods, where the action is added to the reference pose: $\bm{q}^{d}_{t}=\hat{\bm{q}}_{t}+\bm{a}_{t}$ to speed up training. As our PHC needs to remain robust to noisy and ill-posed reference motion, we remove such a dependency on reference motion in our action space. We do not use any external forces [52] or meta-PD control[54].
我们在人形机器人的每个 DoF 处使用比例微分 (PD) 控制器，并且动作 $\bm{a}_{t}$ 指定 PD 目标。将目标关节设置为 $\bm{q}_{t}^{d}={\bm{a}_{t}}$ ，每个关节施加的扭矩为 $\bm{\tau}^{i}=\bm{k}^{p}\circ\left({\bm{a}_{t}}-\bm{q}_{t}\right)-\bm{k}^{d}\circ\dot{\bm{q}}_{t}$ 。请注意，这与先前运动模仿方法中使用的残留动作表示[52,22,30]不同，其中动作被添加到参考姿势： $\bm{q}^{d}_{t}=\hat{\bm{q}}_{t}+\bm{a}_{t}$ 以加速训练。由于我们的 PHC 需要对噪声和不适定的参考运动保持鲁棒性，因此我们消除了对动作空间中参考运动的依赖。我们不使用任何外力[52]或元PD控制[54]。

Control Policy and Discriminator
控制策略和鉴别器

Our control policy ${\pi_{\text{PHC}}}$ $(\bm{a}_{t}|{\bm{s}_{t}})=\mathcal{N}(\mu({\bm{s}_{t}}),\sigma)$ represents a Gaussian distribution with fixed diagonal covariance. The AMP discriminator $\bm{\mathcal{D}}(\bm{s}^{\text{p}}_{t-10:t})$ computes a real and fake value based on the current prioproception of the humanoid. All of our networks (discriminator, primitive, value function, and discriminator) are two-layer multilayer perceptrons (MLP) with dimensions [1024, 512].
我们的控制策略 ${\pi_{\text{PHC}}}$ $(\bm{a}_{t}|{\bm{s}_{t}})=\mathcal{N}(\mu({\bm{s}_{t}}),\sigma)$ 表示具有固定对角协方差的高斯分布。 AMP 判别器 $\bm{\mathcal{D}}(\bm{s}^{\text{p}}_{t-10:t})$ 根据人形机器人当前的先觉来计算真值和假值。我们所有的网络（鉴别器、原语、值函数和鉴别器）都是具有维度 [1024, 512] 的两层多层感知器 (MLP)。

Humanoid 人形

Our humanoid controller can support any human kinematic structure, and we use the SMPL [21] kinematic structure following prior arts [54, 22, 23]. The SMPL body contains 24 rigid bodies, of which 23 are actuated, resulting in an action space of $\bm{a}_{t}\in\mathbb{R}^{23\times 3}$ . The body proportion can vary based on a body shape parameter $\beta\in\mathbb{R}^{10}$ .
我们的人形控制器可以支持任何人体运动结构，并且我们使用遵循现有技术[54,22,23]的SMPL[21]运动结构。 SMPL 主体包含 24 个刚体，其中 23 个被驱动，导致动作空间为 $\bm{a}_{t}\in\mathbb{R}^{23\times 3}$ 。身体比例可以根据身体形状参数 $\beta\in\mathbb{R}^{10}$ 而变化。

Initialization and Relaxed Early Termination
初始化和宽松的提前终止

We use reference state initialization (RSI) [31] during training and randomly select a starting point for a motion clip for imitation. For early termination, we follow UHC [22] and terminate the episode when the joints are more than 0.5 meters globally on average from the reference motion. Unlike UHC, we remove the ankle and toe joints from the termination condition. As observed by RFC [52], there exists a dynamics mismatch between simulated humanoids and real humans, especially since the real human foot is multisegment [29]. Thus, it is not possible for the simulated humanoid to have the exact same foot movement as MoCap, and blindly following the reference foot movement may lead to the humanoid losing balance. Thus, we propose Relaxed Early Termination (RET), which allows the humanoid’s ankle and toes to slightly deviate from the MoCap motion to remain balanced. Notice that the humanoid still receives imitation and discriminator rewards for these body parts, which prevents these joints from moving in a nonhuman manner. We show that though this is a small detail, it is conducive to achieving a good motion imitation success rate.
我们在训练期间使用参考状态初始化（RSI）[31]，并随机选择运动剪辑的起点进行模仿。为了提前终止，我们遵循 UHC [22]，并在关节全局平均距参考运动超过 0.5 米时终止该事件。与 UHC 不同，我们将踝关节和脚趾关节从终止条件中移除。正如 RFC [52] 所观察到的，模拟人形机器人和真实人类之间存在动态不匹配，特别是因为真实的人足是多节段的 [29]。因此，模拟人形机器人不可能具有与 MoCap 完全相同的足部运动，盲目跟随参考足部运动可能会导致人形动物失去平衡。因此，我们提出了放松提前终止（RET），它允许人形机器人的脚踝和脚趾稍微偏离MoCap运动以保持平衡。请注意，类人生物仍然会收到这些身体部位的模仿和鉴别器奖励，这会阻止这些关节以非人类的方式移动。我们表明，虽然这是一个小细节，但它有利于实现良好的动作模仿成功率。

Hard Negative Mining 硬负挖掘

When learning from a large motion dataset, it is essential to train on harder sequences in the later stages of training to gather more informative experiences. We use a similar hard negative mining procedure as in UHC [22] and define hard sequences by whether or not our controller can successfully imitate this sequence. From a motion dataset $\bm{\hat{Q}}$ , we find hard sequences $\bm{\hat{Q}}_{\text{hard}}\subseteq\bm{\hat{Q}}$ by evaluating our model over the entire dataset and choosing sequences that our policy fails to imitate.
当从大型运动数据集学习时，有必要在训练的后期阶段对更难的序列进行训练，以收集更多信息丰富的经验。我们使用与 UHC [22] 中类似的硬负挖掘程序，并根据我们的控制器是否可以成功模仿该序列来定义硬序列。从运动数据集 $\bm{\hat{Q}}$ 中，我们通过在整个数据集上评估我们的模型并选择我们的策略无法模仿的序列来找到硬序列 $\bm{\hat{Q}}_{\text{hard}}\subseteq\bm{\hat{Q}}$ 。

3.2 Progressive Multiplicative Control Policy
3.2渐进乘法控制策略

As training continues, we notice that the performance of the model plateaus as it forgets older sequences when learning new ones. Hard negative mining alleviates the problem to a certain extent, yet suffers from the same issue. Introducing new tasks, such as fail-state recovery, may further degrade imitation performance due to catastrophic forgetting. These effects are more concretely categorized in the Appendix (App. C). Thus, we propose a progressive multiplicative control policy (PMCP), which allocates new subnetworks (primitives $\bm{\mathcal{P}}$ ) to learn harder sequences.
随着训练的继续，我们注意到模型的性能趋于稳定，因为它在学习新序列时忘记了旧序列。硬负挖矿在一定程度上缓解了这个问题，但也遇到了同样的问题。引入新任务（例如故障状态恢复）可能会因灾难性遗忘而进一步降低模仿性能。这些影响在附录（附录 C）中进行了更具体的分类。因此，我们提出了一种渐进乘法控制策略（PMCP），它分配新的子网络（基元 $\bm{\mathcal{P}}$ ）来学习更难的序列。

Progressive Neural Networks (PNN)
渐进神经网络 (PNN)

A PNN [36] starts with a single primitive network $\bm{\mathcal{P}}^{(1)}$ trained on the full dataset $\bm{\hat{Q}}$ . Once $\bm{\mathcal{P}}^{(1)}$ is trained to convergence on the entire motion dataset $\bm{\hat{Q}}$ using the imitation task, we create a subset of hard motions by evaluating $\bm{\mathcal{P}}^{(1)}$ on $\bm{\hat{Q}}$ . We define convergence as the success rate on $\bm{\hat{Q}}_{\text{hard}}^{(k)}$ no longer increases. The sequences that $\bm{\mathcal{P}}^{(1)}$ fails on is formed as $\bm{\hat{Q}}_{\text{hard}}^{(1)}$ . We then freeze the parameters of $\bm{\mathcal{P}}^{(1)}$ and create a new primitive $\bm{\mathcal{P}}^{(2)}$ (randomly initialized) along with lateral connections that connect each layer of $\bm{\mathcal{P}}^{(1)}$ to $\bm{\mathcal{P}}^{(2)}$ . For more information about PNN, please refer to our supplementary material. During training, we construct each $\bm{\hat{Q}}_{\text{hard}}^{(k)}$ by selecting the failed sequences from the previous step $\bm{\hat{Q}}_{\text{hard}}^{(k-1)}$ , resulting in a smaller and smaller hard subset: $\bm{\hat{Q}}_{\text{hard}}^{(k)}\subseteq\bm{\hat{Q}}_{\text{hard}}^{(k-1)}$ . In this way, we ensure that each newly initiated primitive $\bm{\mathcal{P}}^{(k)}$ is responsible for learning a new and harder subset of motion sequences, as can be seen in Fig.2. Notice that this is different from hard-negative mining in UHC [22], as we initialize a new primitive $\bm{\mathcal{P}}^{(k+1)}$ to train. Since the original PNN is proposed to solve completely new tasks (such as different Atari games), a lateral connection mechanism is proposed to allow later tasks to choose between reuse, modify, or discard prior experiences. However, mimicking human motion is highly correlated, where fitting to harder sequences $\bm{\hat{Q}}_{\text{hard}}^{(k)}$ can effectively draw experiences from previous motor control experiences. Thus, we also consider a variant of PNN where there are no lateral connections, but the new primitives are initialized from the weights of the prior layer. This weight sharing scheme is similar to fine-tuning on the harder motion sequences using a new primitive $\bm{\mathcal{P}}^{(k+1)}$ and preserve $\bm{\mathcal{P}}^{(k)}$ ’s ability to imitate learned sequences.
PNN [36] 从在完整数据集 $\bm{\hat{Q}}$ 上训练的单个原始网络 $\bm{\mathcal{P}}^{(1)}$ 开始。使用模仿任务训练 $\bm{\mathcal{P}}^{(1)}$ 在整个运动数据集 $\bm{\hat{Q}}$ 上收敛后，我们通过在 $\bm{\hat{Q}}$ 来创建硬运动子集/b5> .我们将收敛定义为 $\bm{\hat{Q}}_{\text{hard}}^{(k)}$ 的成功率不再增加。 $\bm{\mathcal{P}}^{(1)}$ 失败的序列形成为 $\bm{\hat{Q}}_{\text{hard}}^{(1)}$ 。然后，我们冻结 $\bm{\mathcal{P}}^{(1)}$ 的参数并创建一个新的原语 $\bm{\mathcal{P}}^{(2)}$ （随机初始化）以及将 $\bm{\mathcal{P}}^{(1)}$ 的每一层连接到 $\bm{\mathcal{P}}^{(2)}$ 来构建每个 $\bm{\hat{Q}}_{\text{hard}}^{(k)}$ ，从而产生越来越小的硬子集： $\bm{\hat{Q}}_{\text{hard}}^{(k)}\subseteq\bm{\hat{Q}}_{\text{hard}}^{(k-1)}$ 。通过这种方式，我们确保每个新启动的基元 $\bm{\mathcal{P}}^{(k)}$ 负责学习新的、更难的运动序列子集，如图 2 所示。请注意，这与 UHC [22] 中的硬负挖掘不同，因为我们初始化一个新的原语 $\bm{\mathcal{P}}^{(k+1)}$ 进行训练。由于最初的 PNN 是为了解决全新的任务（例如不同的 Atari 游戏）而提出的，因此提出了横向连接机制，以允许后续任务在重用、修改或丢弃先前的经验之间进行选择。然而，模仿人类运动是高度相关的，其中拟合更难的序列 $\bm{\hat{Q}}_{\text{hard}}^{(k)}$ 可以有效地从以前的运动控制经验中汲取经验。因此，我们还考虑 PNN 的一种变体，其中没有横向连接，但新基元是根据前一层的权重初始化的。这种权重共享方案类似于使用新原语 $\bm{\mathcal{P}}^{(k+1)}$ 对较难的运动序列进行微调，并保留 $\bm{\mathcal{P}}^{(k)}$ 模仿学习序列的能力。

1 Function TrainPPO( $\pi$ , $\bm{\hat{Q}}^{(k)}$ , $\bm{\mathcal{D}}$ , $\bm{\mathcal{V}}$ , $\bm{\mathcal{R}}$ ):
1函数TrainPPO（

\pi

，

\bm{\hat{Q}}^{(k)}

，

\bm{\mathcal{D}}

，

\bm{\mathcal{V}}

，

\bm{\mathcal{R}}

）：

2 while not converged do
2 当不收敛时

{\bm{M}}\leftarrow\emptyset

initialize sampling memory ;
3

{\bm{M}}\leftarrow\emptyset

初始化采样内存；

4 while ${\bm{M}}$ not full do
4 当

{\bm{M}}

未满时执行

{\bm{\hat{q}}_{1:T}}\leftarrow

sample motion from

\bm{\hat{Q}}

;
5

{\bm{\hat{q}}_{1:T}}\leftarrow

来自

\bm{\hat{Q}}

的运动样本；

6 for $t\leftarrow 1...T$ do
6 表示

t\leftarrow 1...T

执行

\bm{s}_{t}\leftarrow\left({\bm{s}^{\text{p}}_{t}},{\bm{s}^{\text{g}}_{t}}\right)

;
7

\bm{s}_{t}\leftarrow\left({\bm{s}^{\text{p}}_{t}},{\bm{s}^{\text{g}}_{t}}\right)

;

\bm{a}_{t}\leftarrow{\pi}({\bm{a}_{t}}|{\bm{s}_{t}})

;
8

\bm{a}_{t}\leftarrow{\pi}({\bm{a}_{t}}|{\bm{s}_{t}})

;

\bm{s}_{t+1}\leftarrow\mathcal{T}(\bm{s}_{t+1}|\bm{s}_{t},\bm{a}_{t})

// simulation ;
9

\bm{s}_{t+1}\leftarrow\mathcal{T}(\bm{s}_{t+1}|\bm{s}_{t},\bm{a}_{t})

// 模拟；

\bm{r}_{t}\leftarrow{\bm{\mathcal{R}}}({\bm{s}_{t}},{\bm{\hat{q}}_{t+1}})

;
10

\bm{r}_{t}\leftarrow{\bm{\mathcal{R}}}({\bm{s}_{t}},{\bm{\hat{q}}_{t+1}})

;

11 store

(\bm{s}_{t},\bm{a}_{t},\bm{r}_{t},\bm{s}_{t+1})

into memory

{\bm{M}}

;
11 将

(\bm{s}_{t},\bm{a}_{t},\bm{r}_{t},\bm{s}_{t+1})

存入内存

{\bm{M}}

；

\bm{\mathcal{P}}^{(k)},\bm{\mathcal{V}}\leftarrow

PPO update using experiences collected in

{\bm{M}}

;
14

\bm{\mathcal{P}}^{(k)},\bm{\mathcal{V}}\leftarrow

使用

{\bm{M}}

中收集的经验更新 PPO；

\bm{\mathcal{D}}\leftarrow

Discriminator update using experiences collected in

{\bm{M}}

\bm{\mathcal{D}}\leftarrow

使用

{\bm{M}}

中收集的经验更新判别器

16 return

\pi

;
16 返回

\pi

；

19 Input: Ground truth motion dataset

\bm{\hat{Q}}

;
19输入：地面真实运动数据集

\bm{\hat{Q}}

；

\bm{\mathcal{D}}

\bm{\mathcal{V}}

\bm{\hat{Q}}_{\text{hard}}^{(1)}\leftarrow\bm{\hat{Q}}

// Initialize discriminator, value function, and dataset ;
21

\bm{\mathcal{D}}

\bm{\mathcal{V}}

\bm{\hat{Q}}_{\text{hard}}^{(1)}\leftarrow\bm{\hat{Q}}

// 初始化判别器、值函数和数据集；

23for $k\leftarrow 1...K$ do
23对于

k\leftarrow 1...K

做

24 Initialize

\bm{\mathcal{P}}^{(k)}

// Lateral connection/weight sharing ;
24 初始化

\bm{\mathcal{P}}^{(k)}

// 横向连接/权重共享 ;

\bm{\mathcal{P}}^{(k)}\leftarrow

TrainPPO( $\bm{\mathcal{P}}^{(k)}$ , $\bm{\hat{Q}}_{\text{hard}}^{(k+1)}$ , $\bm{\mathcal{D}}$ , $\bm{\mathcal{V}}$ , $\bm{\mathcal{R}}^{\text{imitation}}$ ) ;
25

\bm{\mathcal{P}}^{(k)}\leftarrow

TrainPPO(

\bm{\mathcal{P}}^{(k)}

\bm{\hat{Q}}_{\text{hard}}^{(k+1)}

\bm{\mathcal{D}}

\bm{\mathcal{V}}

\bm{\mathcal{R}}^{\text{imitation}}

) ;

\bm{\hat{Q}}_{\text{hard}}^{(k+1)}\leftarrow

eval(

\bm{\mathcal{P}}^{(k)}

\bm{\hat{Q}}^{(k)}

) ;
26

\bm{\hat{Q}}_{\text{hard}}^{(k+1)}\leftarrow

eval(

\bm{\mathcal{P}}^{(k)}

\bm{\hat{Q}}^{(k)}

) ;

\bm{\mathcal{P}}^{(k)}\leftarrow

freeze

\bm{\mathcal{P}}^{(k)}

;
27

\bm{\mathcal{P}}^{(k)}\leftarrow

冻结

\bm{\mathcal{P}}^{(k)}

；

\bm{\mathcal{P}}^{(F)}\leftarrow

TrainPPO( $\bm{\mathcal{P}}^{(F)}$ , $\bm{Q}^{\text{loco}}$ , $\bm{\mathcal{D}}$ , $\bm{\mathcal{V}}$ , $\bm{\mathcal{R}}^{\text{recover}}$ ) // Fail-state Recovery ;
30

\bm{\mathcal{P}}^{(F)}\leftarrow

TrainPPO(

\bm{\mathcal{P}}^{(F)}

\bm{Q}^{\text{loco}}

\bm{\mathcal{D}}

\bm{\mathcal{V}}

\bm{\mathcal{R}}^{\text{recover}}

) // 失败状态恢复;

{\pi_{\text{PHC}}}\leftarrow{\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)},\bm{\mathcal{C}}\}}

; 32

{\pi_{\text{PHC}}}\leftarrow{\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)},\bm{\mathcal{C}}\}}

;

{\pi_{\text{PHC}}}\leftarrow

TrainPPO( ${\pi_{\text{PHC}}}$ , $\bm{\hat{Q}}$ , $\bm{\mathcal{D}}$ , $\bm{\mathcal{V}}$ , $\{\bm{\mathcal{R}}^{\text{imitation}},\bm{\mathcal{R}}^{\text{recover}}\}$ ) // Train Composer ;
33

{\pi_{\text{PHC}}}\leftarrow

TrainPPO(

{\pi_{\text{PHC}}}

\bm{\hat{Q}}

\bm{\mathcal{D}}

\bm{\mathcal{V}}

\{\bm{\mathcal{R}}^{\text{imitation}},\bm{\mathcal{R}}^{\text{recover}}\}

) // 训练作曲家；

Algo 1 Learn Progressive Multiplicative Control Policy
算法 1 学习渐进乘法控制策略

Fail-state Recovery 故障状态恢复

In addition to learning harder sequences, we also learn new tasks, such as recovering from fail-state. We define three types of fail-state: 1) fallen on the ground; 2) faraway from the reference motion ( $>0.5m$ ); 3) their combination: fallen and faraway. In these situations, the humanoid should get up from the ground, approach the reference motion in a natural way, and resume motion imitation. For this new task, we initialize a primitive $\bm{\mathcal{P}}^{(F)}$ at the end of the primitive stack. $\bm{\mathcal{P}}^{(F)}$ shares the same input and output space as $\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(k)}$ , but since the reference motion does not provide useful information about fail-state recovery (the humanoid should not attempt to imitate the reference motion when lying on the ground), we modify the state space during fail-state recovery to remove all information about the reference motion except the root. For the reference joint rotation ${\bm{\hat{\theta}}_{t}}=[\bm{\hat{\theta}}_{t}^{0},\bm{\hat{\theta}}_{t}^{1},\cdots\bm{\hat{\theta}}_{t}^{J}]$ where $\bm{\hat{\theta}}_{t}^{i}$ corresponds to the $i^{\text{th}}$ joint, we construct ${\bm{\hat{\theta}}_{t}}^{\prime}=[\bm{\hat{\theta}}_{t}^{0},\bm{{\theta}}_{t}^{1},\cdots\bm{{\theta}}_{t}^{j}]$ where all joint rotations except the root are replaced with simulated values (without $\widehat{\cdot}$ ). This amounts to setting the non-root joint goals to be identity when computing the goal states: $\bm{s}^{\text{g-Fail}}_{t}\triangleq(\bm{\hat{\theta}}_{t}^{\prime}\ominus\bm{\theta}_{t},\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t},\bm{\hat{v}}_{t}^{\prime}-\bm{v}_{t},\bm{\hat{\omega}}_{t}^{\prime}-\bm{\omega}_{t},\bm{\hat{\theta}}_{t}^{\prime},\bm{\hat{p}}_{t}^{\prime})$ . $\bm{s}^{\text{g-Fail}}_{t}$ thus collapse from an imitation objective to a point-goal [49] objective where the only information provided is the relative position and orientation of the target root. When the reference root is too far ( $>5m$ ), we normalize $\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t}$ as $\frac{5\times(\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t})}{\|\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t}\|_{2}}$ to clamp the goal position. Once the humanoid is close enough (e.g. $<0.5m$ ), the goal will switch back to full-motion imitation:
除了学习更难的序列之外，我们还学习新任务，例如从失败状态中恢复。我们定义了三种类型的失败状态：1）跌倒在地上； 2）远离参考运动（ $>0.5m$ ）； 3）他们的组合：堕落和遥远。在这些情况下，人形机器人应该从地面站起来，以自然的方式接近参考运动，然后恢复运动模仿。对于这个新任务，我们在基元堆栈末尾初始化一个基元 $\bm{\mathcal{P}}^{(F)}$ 。 $\bm{\mathcal{P}}^{(F)}$ 与 $\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(k)}$ 共享相同的输入和输出空间，但由于参考运动不提供有关失败状态恢复的有用信息（人形机器人不应在以下情况下尝试模仿参考运动）：躺在地上），我们在故障状态恢复期间修改状态空间，以删除除根之外的有关参考运动的所有信息。对于参考关节旋转 ${\bm{\hat{\theta}}_{t}}=[\bm{\hat{\theta}}_{t}^{0},\bm{\hat{\theta}}_{t}^{1},\cdots\bm{\hat{\theta}}_{t}^{J}]$ ，其中 $\bm{\hat{\theta}}_{t}^{i}$ 对应于 $i^{\text{th}}$ 关节，我们构造 ${\bm{\hat{\theta}}_{t}}^{\prime}=[\bm{\hat{\theta}}_{t}^{0},\bm{{\theta}}_{t}^{1},\cdots\bm{{\theta}}_{t}^{j}]$ ，其中除根之外的所有关节旋转都被替换具有模拟值（没有 $\widehat{\cdot}$ ）。这相当于在计算目标状态时将非根联合目标设置为同一： $\bm{s}^{\text{g-Fail}}_{t}\triangleq(\bm{\hat{\theta}}_{t}^{\prime}\ominus\bm{\theta}_{t},\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t},\bm{\hat{v}}_{t}^{\prime}-\bm{v}_{t},\bm{\hat{\omega}}_{t}^{\prime}-\bm{\omega}_{t},\bm{\hat{\theta}}_{t}^{\prime},\bm{\hat{p}}_{t}^{\prime})$ 。 $\bm{s}^{\text{g-Fail}}_{t}$ 因此从模仿目标崩溃为点目标[49]目标，其中提供的唯一信息是目标根的相对位置和方向。当参考根太远（ $>5m$ ）时，我们将 $\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t}$ 标准化为 $\frac{5\times(\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t})}{\|\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t}\|_{2}}$ 以限制目标位置。一旦人形物体足够接近（例如 $<0.5m$ ），目标将切换回全动作模仿：

\bm{s}^{\text{g}}_{t}=\begin{cases}\bm{s}^{\text{g}}_{t}&\|\bm{\hat{p}}^{0}_{t}-\bm{p}^{0}_{t}\|_{2}\leq 0.5\\ \bm{s}^{\text{g-Fail}}_{t}&\text{otherwise}.\\ \end{cases}

(3)

To create fallen states, we follow ASE [34] and randomly drop the humanoid on the ground at the beginning of the episode. The faraway state can be created by initializing the humanoid 2 $\sim$ 5 meters from the reference motion. The reward for fail-state recovery consists of the AMP reward $r^{\text{amp}}_{t}$ , point-goal reward $r^{\text{g-point}}_{t}$ , and energy penality $r^{\text{energy}}_{t}$ , calculated by the reward function $\bm{\mathcal{R}}^{\text{recover}}$ :
为了创建堕落状态，我们遵循 ASE [34] 并在剧集开始时将人形机器人随机掉落在地上。可以通过在距离参考运动 5 米处初始化人形机器人 2 $\sim$ 来创建远处状态。失败状态恢复的奖励由 AMP 奖励 $r^{\text{amp}}_{t}$ 、分数目标奖励 $r^{\text{g-point}}_{t}$ 和能量惩罚 $r^{\text{energy}}_{t}$ 组成，由奖励函数 $\bm{\mathcal{R}}^{\text{recover}}$ ：

r^{\text{g-recover}}_{t}=\bm{\mathcal{R}}^{\text{recover}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}})=0.5r^{\text{g-point}}_{t}+0.5r^{\text{amp}}_{t}+0.1r^{\text{energy}}_{t},

(4)

The point-goal reward is formulated as $r^{\text{g-point}}_{t}=\left(d_{t-1}-d_{t}\right)$ where $d_{t}$ is the distance between the root reference and simulated root at the time step $t$ [49]. For training $\bm{\mathcal{P}}^{(F)}$ , we use a handpicked subset of the AMASS dataset named $\bm{Q}^{\text{loco}}$ where it contains mainly walking and running sequences. Learning using only $\bm{Q}^{\text{loco}}$ coaxes the discriminator $\bm{\mathcal{D}}$ and the AMP reward $r^{\text{amp}}_{t}$ to bias toward simple locomotion such as walking and running. We do not initialize a new value function and discriminator while training the primitives and continuously fine-tune the existing ones.
点目标奖励被公式化为 $r^{\text{g-point}}_{t}=\left(d_{t-1}-d_{t}\right)$ ，其中 $d_{t}$ 是时间步 $t$ 处的根参考和模拟根之间的距离[49]。对于训练 $\bm{\mathcal{P}}^{(F)}$ ，我们使用名为 $\bm{Q}^{\text{loco}}$ 的 AMASS 数据集精心挑选的子集，其中主要包含步行和跑步序列。仅使用 $\bm{Q}^{\text{loco}}$ 进行学习会诱使鉴别器 $\bm{\mathcal{D}}$ 和 AMP 奖励 $r^{\text{amp}}_{t}$ 偏向于简单的运动，例如步行和跑步。我们在训练原语时不会初始化新的价值函数和判别器，并不断微调现有的。

Multiplicative Control 乘法控制

Once each primitive has been learned, we obtain $\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)}\}$ , with each primitive capable of imitating a subset of the dataset $\bm{\hat{Q}}$ . In Progressive Networks [36], task switching is performed manually. In motion imitation, however, the boundary between hard and easy sequences is blurred. Thus, we utilize Multiplicative Control Policy (MCP) [33] and train an additional composer $\bm{\mathcal{C}}$ to dynamically combine the learned primitives. Essentially, we use the pretrained primitives as a informed search space for the composer $\bm{\mathcal{C}}$ , and $\bm{\mathcal{C}}$ only needs to select which primitives to activate for imitation. Specifically, our composer $\bm{\mathcal{C}}(\bm{w}^{1:K+1}_{t}|{\bm{s}_{t}})$ consumes the same input as the primitives and outputs a weight vector $\bm{w}^{1:K+1}_{t}\in\mathbb{R}^{k+1}$ to activate the primitives. Combining our composer and primitives, we have the PHC’s output distribution:
一旦学习了每个基元，我们就获得 $\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)}\}$ ，每个基元都能够模仿数据集 $\bm{\hat{Q}}$ 的子集。在渐进网络[36]中，任务切换是手动执行的。然而，在动作模仿中，困难序列和简单序列之间的界限是模糊的。因此，我们利用乘法控制策略（MCP）[33]并训练一个额外的组合器 $\bm{\mathcal{C}}$ 来动态组合学习的原语。本质上，我们使用预训练的基元作为作曲家 $\bm{\mathcal{C}}$ 的知情搜索空间，而 $\bm{\mathcal{C}}$ 只需选择要激活哪些基元进行模仿。具体来说，我们的作曲家 $\bm{\mathcal{C}}(\bm{w}^{1:K+1}_{t}|{\bm{s}_{t}})$ 使用与基元相同的输入，并输出权重向量 $\bm{w}^{1:K+1}_{t}\in\mathbb{R}^{k+1}$ 来激活基元。结合我们的作曲家和原语，我们得到了 PHC 的输出分布：

\scriptsize{\pi_{\text{PHC}}}(\bm{a}_{t}\mid{\bm{s}_{t}})=\frac{1}{\bm{\mathcal{C}}({\bm{s}_{t}})}\prod_{i}^{k}\bm{\mathcal{P}}^{(i)}(\bm{a}^{(i)}_{t}\mid{\bm{s}_{t}})^{\bm{\mathcal{C}}({\bm{s}_{t}})},\quad\bm{\mathcal{C}}({\bm{s}_{t}})\geq 0.

(5)

As each $\bm{\mathcal{P}}^{(k)}$ is an independent Gaussian, the action distribution:
由于每个 $\bm{\mathcal{P}}^{(k)}$ 都是独立的高斯分布，因此动作分布：

\scriptsize\mathcal{N}\left(\frac{1}{\sum_{l}^{k}\frac{\bm{\mathcal{C}}_{i}({\bm{s}_{t}})}{\sigma_{l}^{j}({\bm{s}_{t}})}}\sum_{i}^{k}\frac{\bm{\mathcal{C}}_{i}({\bm{s}_{t}})}{\sigma_{i}^{j}({\bm{s}_{t}})}\mu_{i}^{j}({\bm{s}_{t}}),\sigma^{j}({\bm{s}_{t}})=\left(\sum_{i}^{k}\frac{\bm{\mathcal{C}}_{i}(\bm{s}_{t})}{\sigma_{i}^{j}({\bm{s}_{t}})}\right)^{-1}\right),

(6)

where $\mu_{i}^{j}({\bm{s}_{t}})$ corresponds to the $\bm{\mathcal{P}}^{\text{(i)}}$ ’s $j^{\text{th}}$ action dimension. Unlike a Mixture of Expert policies that only activates one at a time (top-1 MOE), MCP combines the actors’ distribution and activates all actors at the same (similar to top-inf MOE). Unlike MCP, we progressively train our primitives and make the composer and actor share the same input space. Since primitives are independently trained for different harder sequences, we observe that the composite policy sees a significant boost in performance. During composer training, we interleave fail-state recovery training. The training process is described in Alg.1 and Fig.2
其中 $\mu_{i}^{j}({\bm{s}_{t}})$ 对应于 $\bm{\mathcal{P}}^{\text{(i)}}$ 的 $j^{\text{th}}$ 操作维度。与一次仅激活一个的专家策略混合（top-1 MOE）不同，MCP 结合了参与者的分布并同时激活所有参与者（类似于 top-inf MOE）。与 MCP 不同，我们逐步训练我们的基元，并使作曲家和演员共享相同的输入空间。由于基元是针对不同难度的序列进行独立训练的，因此我们观察到复合策略的性能得到了显着提升。在作曲家培训期间，我们会穿插失败状态恢复培训。训练过程如Alg.1和Fig.2所示.

3.3 Connecting with Motion Estimators
3.3与运动估计器连接

Our PHC is task-agnostic as it only requires the next time-step reference pose ${\bm{\tilde{q}}_{t}}$ or the keypoint ${\bm{\tilde{p}}_{t}}$ for motion tracking. Thus, we can use any off-the-shelf video-based human pose estimator or generator compatible with the SMPL kinematic structure. For driving simulated avatars from videos, we employ HybrIK [19] and MeTRAbs [41, 40], both of which estimate in the metric space with the important distinction that HybrIK outputs joint rotation ${\bm{\tilde{\theta}}_{t}}$ while MeTRAbs only outputs 3D keypoints ${\bm{\tilde{p}}_{t}}$ . For language-based motion generation, we use the Motion Diffusion Model (MDM) [43]. MDM generates disjoint motion sequences based on prompts, and we use our controller’s recovery ability to achieve in-betweening.
我们的 PHC 与任务无关，因为它只需要下一个时间步参考姿势 ${\bm{\tilde{q}}_{t}}$ 或关键点 ${\bm{\tilde{p}}_{t}}$ 进行运动跟踪。因此，我们可以使用任何现成的、基于视频的人体姿态估计器或与 SMPL 运动学结构兼容的生成器。为了从视频中驱动模拟化身，我们采用 HybrIK [19] 和 MeTRAbs [41, 40]，两者都在度量空间中进行估计，其重要区别是 HybrIK 输出联合旋转 ${\bm{\tilde{\theta}}_{t}}$ 而 MeTRAbs 仅输出 3D关键点 ${\bm{\tilde{p}}_{t}}$ 。对于基于语言的运动生成，我们使用运动扩散模型（MDM）[43]。 MDM 根据提示生成不相交的运动序列，我们使用控制器的恢复能力来实现中间运动。

Table 1: Quantitative results on imitating MoCap motion sequences (* indicates removing sequences containing human-object interaction). AMASS-Train*, AMASS-Test*, and H36M-Motion* contains 11313, 140, and 140 high-quality MoCap sequences, respectively.
表 1：模仿 MoCap 运动序列的定量结果（* 表示删除包含人机交互的序列）。 AMASS-Train*、AMASS-Test* 和 H36M-Motion* 分别包含 11313、140 和 140 个高质量 MoCap 序列。

		AMASS-Train* AMASS-火车*					AMASS-Test* AMASS 测试*					H36M-Motion* H36M-运动*
Method 方法	RFC	$\text{Succ}\uparrow$	$E_{\text{g-mpjpe}}\downarrow$	$E_{\text{mpjpe}}\downarrow$	$\text{E}_{\text{acc}}\downarrow$	$\text{E}_{\text{vel}}\downarrow$	$\text{Succ}\uparrow$	$E_{\text{g-mpjpe}}\downarrow$	$E_{\text{mpjpe}}\downarrow$	$\text{E}_{\text{acc}}\downarrow$	$\text{E}_{\text{vel}}\downarrow$	$\text{Succ}\uparrow$	$E_{\text{g-mpjpe}}\downarrow$	$E_{\text{mpjpe}}\downarrow$	$\text{E}_{\text{acc}}\downarrow$	$\text{E}_{\text{vel}}\downarrow$
UHC	✓	97.0 %	36.4	25.1	4.4	5.9	96.4 %	50.0	31.2	9.7	12.1	87.0%	59.7	35.4	4.9	7.4
UHC	✗	84.5 %	62.7	39.6	10.9	10.9	62.6%	58.2	98.1	22.8	21.9	23.6%	133.14	67.4	14.9	17.2
Ours 我们的	✗	98.9 %	37.5	26.9	3.3	4.9	96.4%	47.4	30.9	6.8	9.1	92.9%	50.3	33.3	3.7	5.5
Ours-kp 我们的-kp	✗	98.7%	40.7	32.3	3.5	5.5	97.1%	53.1	39.5	7.5	10.4	95.7%	49.5	39.2	3.7	5.8

Table 2: Motion imitation on noisy motion. We use HybrIK[19] to estimate the joint rotations

{\bm{\tilde{\theta}}_{t}}

and uses MeTRAbs [41] for global 3D keypoints

{\bm{\tilde{p}}_{t}}

. HybrIK + MeTRAbs (root): using joint rotations

{\bm{\tilde{\theta}}_{t}}

from HybrIK and root position

{\bm{\tilde{p}}_{t}}^{0}

from MeTRAbs. MeTRAbs (all keypoints): using all keypoints

{\bm{\tilde{p}}_{t}}

from MeTRAbs, only applicabile to our keypoint-based controller.
表 2：噪声运动的运动模仿。我们使用 HybrIK[19] 来估计关节旋转

{\bm{\tilde{\theta}}_{t}}

并使用 MeTRAbs [41] 来估计全局 3D 关键点

{\bm{\tilde{p}}_{t}}

。 HybrIK + MeTRAbs（根）：使用来自 HybrIK 的关节旋转

{\bm{\tilde{\theta}}_{t}}

和来自 MeTRAbs 的根位置

{\bm{\tilde{p}}_{t}}^{0}

。 MeTRAbs（所有关键点）：使用 MeTRAbs 中的所有关键点

{\bm{\tilde{p}}_{t}}

，仅适用于我们基于关键点的控制器。

			H36M-Test-Video* H36M-测试视频*
Method 方法	RFC	Pose Estimate 姿势估计	$\text{Succ}\uparrow$	$E_{\text{g-mpjpe}}\downarrow$	$E_{\text{mpjpe}}\downarrow$
UHC	✓	HybrIK + MeTRAbs (root) HybrIK + MeTRAbs（根）	58.1%	75.5	49.3
UHC	✗	HybrIK + MeTRAbs (root) HybrIK + MeTRAbs（根）	18.1%	126.1	67.1
Ours 我们的	✗	HybrIK + MeTRAbs (root) HybrIK + MeTRAbs（根）	88.7%	55.4	34.7
Ours-kp 我们的-kp	✗	HybrIK + MeTRAbs (root) HybrIK + MeTRAbs（根）	90.0%	55.8	41.0
Ours-kp 我们的-kp	✗	MeTRAbs (all keypoints) MeTRAb（所有关键点）	91.9%	55.7	41.1

4 Experiments 4实验

We evaluate and ablate our humanoid controller’s ability to imitate high-quality MoCap sequences and noisy motion sequences estimated from videos in Sec.4.1. In Sec.4.2, we test our controller’s ability to recovery from fail-state. As motion is best in videos, we provide extensive qualitative results in the supplementary materials. All experiments are run three times and averaged.
我们评估并消除了人形控制器模仿高质量 MoCap 序列和从第 4.1 节中的视频估计的噪声运动序列的能力。在第 4.2 节中，我们测试控制器从故障状态恢复的能力。由于视频中的运动效果最好，因此我们在补充材料中提供了广泛的定性结果。所有实验运行三次并取平均值。

Baselines 基线

We compare with the SOTA motion imitator UHC [22] and use the official implementation. We compare against UHC both with and without residual force control.
我们与 SOTA 运动模拟器 UHC [22] 进行比较并使用官方实现。我们将带残余力控制和不带残余力控制的 UHC 进行比较。

Implementation Details 实施细节

We uses four primitives (including fail-state recovery) for all our evaluations. PHC can be trained on a single NVIDIA A100 GPU; it takes around a week to train all primitives and the composer. Once trained, the composite policy runs at $>30$ FPS. Physics simulation is carried out in NVIDIA’s Isaac Gym [26]. The control policy is run at 30 Hz, while simulation runs at 60 Hz. For evaluation, we do not consider body shape variation and use the mean SMPL body shape.
我们使用四个原语（包括故障状态恢复）进行所有评估。 PHC 可以在单个 NVIDIA A100 GPU 上进行训练；培训所有原语和作曲家大约需要一周的时间。经过训练后，复合策略将以 $>30$ FPS 运行。物理模拟在 NVIDIA 的 Isaac Gym 中进行[26]。控制策略以 30 Hz 运行，而模拟以 60 Hz 运行。为了进行评估，我们不考虑体型变化，而是使用平均 SMPL 体型。

Datasets 数据集

PHC is trained on the training split of the AMASS [25] dataset. We follow UHC [22] and remove sequences that are noisy or involve interactions of human objects, resulting in 11313 high-quality training sequences and 140 test sequences. To evaluate our policy’s ability to handle unseen MoCap sequences and noisy pose estimate from pose estimation methods, we use the popular H36M dataset [15]. From H36M, we derive two subsets H36M-Motion* and H36M-Test-Video*. H36M-Motion* contains 140 high-quality MoCap sequences from the entire H36M dataset. H36M-Test-Video* contains 160 sequences of noisy poses estimated from videos in the H36M test split (since SOTA pose estimation methods are trained on H36M’s training split). * indicates the removal of sequences containing human-chair interaction.
PHC 在 AMASS [25] 数据集的训练分割上进行训练。我们遵循 UHC [22]，去除噪声或涉及人体交互的序列，得到 11313 个高质量的训练序列和 140 个测试序列。为了评估我们的策略处理看不见的 MoCap 序列和姿势估计方法中的噪声姿势估计的能力，我们使用流行的 H36M 数据集 [15]。从 H36M 中，我们派生出两个子集 H36M-Motion* 和 H36M-Test-Video*。 H36M-Motion* 包含来自整个 H36M 数据集的 140 个高质量 MoCap 序列。 H36M-Test-Video* 包含从 H36M 测试分割中的视频估计的 160 个噪声姿势序列（因为 SOTA 姿势估计方法是在 H36M 的训练分割上进行训练的）。 * 表示删除包含人椅交互的序列。

Metrics 指标

We use a series of pose-based and physics-based metrics to evaluate our motion imitation performance. We report the success rate (Succ) as in UHC [22], deeming imitation unsuccessful when, at any point during imitation, the body joints are on average $>0.5m$ from the reference motion. Succ measures whether the humanoid can track the reference motion without losing balance or significantly lags behind. We also report the root-relative mean per-joint position error (MPJPE) $E_{\text{mpjpe}}$ and the global MPJPE $E_{\text{g-mpjpe}}$ (in mm), measuring our imitator’s ability to imitate the reference motion both locally (root-relative) and globally. To show physical realism, we also compare acceleration $E_{\text{acc}}$ (mm/frame²) and velocity $E_{\text{vel}}$ (mm/frame) difference between simulated and MoCap motion. All the baseline and our methods are physically simulated, so we do not report any foot sliding or penetration.
我们使用一系列基于姿势和基于物理的指标来评估我们的运动模仿性能。我们报告 UHC [22] 中的成功率 (Succ)，当在模仿过程中的任何时刻，身体关节平均 $>0.5m$ 偏离参考运动时，认为模仿不成功。 Succ 衡量人形机器人是否可以跟踪参考运动而不失去平衡或显着滞后。我们还报告根相对平均每关节位置误差 (MPJPE) $E_{\text{mpjpe}}$ 和全局 MPJPE $E_{\text{g-mpjpe}}$ （以毫米为单位），测量我们的模仿者在本地模仿参考运动的能力（根相对）和全局。为了显示物理真实感，我们还比较了模拟运动和 MoCap 运动之间的加速度 $E_{\text{acc}}$ （毫米/帧 ² ）和速度 $E_{\text{vel}}$ （毫米/帧）差异。所有基线和我们的方法都是物理模拟的，因此我们不会报告任何脚部滑动或穿透。

4.1 Motion Imitation 4.1动作模仿

Motion Imitation on High-quality MoCap
高质量动捕上的运动模仿

Table1 reports our motion imitation result on the AMASS train, test, and H36M-Motion* dataset. Comparing with the baseline with RFC, our method outperforms it on almost all metrics across training and test datasets. On the training dataset, PHC has a better success rate while achieving better or similar MPJPE, showcasing its ability to better imitate sequences from the training split. On testing, PHC shows a high success rate on unseen MoCap sequences from both the AMASS and H36M data. Unseen motion poses additional challenges, as can be seen in the larger per-joint error. UHC trained without residual force performs poorly on the test set, showing that it lacks the ability to imitate unseen reference motion. Noticeably, it also has a much larger acceleration error because it uses high-frequency jitter to stay balanced. Compared to UHC, our controller has a low acceleration error even when facing unseen motion sequences, benefiting from the energy penalty and motion prior. Surprisingly, our keypoint-based controller is on par and sometimes outperforms the rotation-based one. This validates that the keypoint-based motion imitator can be a simple and strong alternative to the rotation-based ones.
表 1 报告了我们在 AMASS 训练、测试和 H36M-Motion* 数据集上的运动模仿结果。与 RFC 的基线相比，我们的方法在训练和测试数据集的几乎所有指标上都优于它。在训练数据集上，PHC 具有更高的成功率，同时获得更好或相似的 MPJPE，展示了其更好地模仿训练分割序列的能力。在测试中，PHC 在 AMASS 和 H36M 数据中未见过的 MoCap 序列上显示出很高的成功率。看不见的运动带来了额外的挑战，从更大的每关节误差中可以看出。没有残余力训练的 UHC 在测试集上表现不佳，表明它缺乏模仿看不见的参考运动的能力。值得注意的是，它还具有更大的加速度误差，因为它使用高频抖动来保持平衡。与 UHC 相比，即使面对看不见的运动序列，我们的控制器也具有较低的加速度误差，这得益于能量损失和运动先验。令人惊讶的是，我们基于关键点的控制器与基于旋转的控制器相当，有时甚至优于基于旋转的控制器。这验证了基于关键点的运动模仿器可以成为基于旋转的运动模仿器的简单而强大的替代品。

Motion Imitation on Noisy Input from Video
对视频噪声输入的运动模仿

We use off-the-shelf pose estimators HybrIK [19] and MeTRAbs [41] to extract joint rotation (HybrIK) and keypoints (MeTRAbs) using images from the H36M test set. As a post-processing step, we apply a Gaussian filter to the extracted pose and keypoints. Both HyBrIK and MeTRAbs are per-frame models that do not use any temporal information. Due to depth ambiguity, monocular global pose estimation is highly noisy [41] and suffers from severe depth-wise jitter, posing significant challenge to motion imitators. We find that MeTRAbs outputs better global root estimation ${\bm{\tilde{p}}_{t}}^{0}$ , so we use its ${\bm{\tilde{p}}_{t}}^{0}$ combined with HybrIK’s estimated joint rotation ${\bm{\tilde{\theta}}_{t}}$ (HybrIK + Metrabs (root)). In Table2, we report our controller and baseline’s performance on imitating these noisy sequences. Similar to results on MoCap Imitation, PHC outperforms the baselines by a large margin and achieves a high success rate ( $\sim 90\%$ ). This validates our hypothesis that PHC is robust to noisy motion and can be used to drive simulated avatars directly from videos. Similarly, we see that keypoint-based controller (ours-kp) outperforms rotation-based, which can be explained by 1) estimating 3D keypoint directly from images is an easier task than estimating joint rotations, so keypoints from MeTRABs are of higher quality than joint rotations from HybrIK; 2) our keypoint-based controller is more robust to noisy input as it has the freedom to use any joint configuration to try to match the keypoints.
我们使用现成的姿态估计器 HybrIK [19] 和 MeTRAbs [41] 使用 H36M 测试集中的图像提取关节旋转 (HybrIK) 和关键点 (MeTRAbs)。作为后处理步骤，我们对提取的姿势和关键点应用高斯滤波器。 HyBrIK 和 MeTRAb 都是不使用任何时间信息的每帧模型。由于深度模糊，单目全局姿态估计噪声很大[41]，并且遭受严重的深度抖动，给运动模仿者带来了重大挑战。我们发现 MeTRAbs 输出更好的全局根估计 ${\bm{\tilde{p}}_{t}}^{0}$ ，因此我们将其 ${\bm{\tilde{p}}_{t}}^{0}$ 与 HybrIK 估计的关节旋转 ${\bm{\tilde{\theta}}_{t}}$ 结合使用（HybrIK + Metrabs（根））。在表 2 中，我们报告了控制器和基线在模仿这些噪声序列方面的性能。与 MoCap Imitation 的结果类似，PHC 大幅优于基线，并取得了很高的成功率 ( $\sim 90\%$ )。这验证了我们的假设，即 PHC 对于噪声运动具有鲁棒性，并且可用于直接从视频驱动模拟化身。类似地，我们看到基于关键点的控制器 (ours-kp) 优于基于旋转的控制器，这可以通过以下方式解释：1) 直接从图像估计 3D 关键点比估计关节旋转更容易，因此来自 MeTRAB 的关键点的质量高于HybrIK 的联合旋转； 2）我们基于关键点的控制器对噪声输入更加鲁棒，因为它可以自由地使用任何联合配置来尝试匹配关键点。

Ablations 消融

Table3 shows our controller trained with various components disabled. We perform ablation on the noisy input from H36M-Test-Image* to better showcase the controller’s ability to imitate noisy data. First, we study the performance of our controller before training to recover from fail-state. Comparing row 1 (R1) and R2, we can see that relaxed early termination (RET) allows our policy to better use the ankle and toes for balance. R2 vs R3 shows that using MCP directly without our progressive training process boosts the network performance due to its enlarged network capacity. However, using the PMCP pipeline significantly boosts robustness and imitation performance (R3 vs. R4). Comparing R4 and R5 shows that PMCP is effective in adding fail-state recovery capability without compromising motion imitation. Finally, R5 vs. R6 shows that our keypoint-based imitator can be on-par with rotation-based ones, offering a simpler formulation where only keypoints is needed. For additional ablation on MOE vs. MCP, number of primitives, please refer to the supplement.
表 3 显示了我们在禁用各种组件的情况下训练的控制器。我们对来自 H36M-Test-Image* 的噪声输入进行消融，以更好地展示控制器模拟噪声数据的能力。首先，我们在训练从故障状态恢复之前研究控制器的性能。比较第 1 行 (R1) 和 R2，我们可以看到宽松的提前终止 (RET) 允许我们的策略更好地使用脚踝和脚趾来保持平衡。 R2 与 R3 的比较表明，直接使用 MCP 而无需进行渐进式训练过程，由于网络容量扩大，因此可以提高网络性能。然而，使用 PMCP 管道可以显着提高鲁棒性和模仿性能（R3 与 R4）。比较 R4 和 R5 表明，PMCP 在不影响运动模拟的情况下，可以有效地增加故障状态恢复能力。最后，R5 与 R6 的比较表明，我们基于关键点的模仿器可以与基于旋转的模仿器相媲美，从而提供了一种仅需要关键点的更简单的公式。有关 MOE 与 MCP 的额外消融、原语数量，请参阅补充。

Real-time Simulated Avatars
实时模拟头像

We demonstrate our controller’s ability to imitate pose estimates streamed in real-time from videos. Fig.4 shows a qualitative result on a live demonstration of using poses estimated from an office environment. To achieve this, we use our keypoint-based controller and MeTRAbs-estimated keypoints in a streaming fashion. The actor performs a series of motions, such as posing and jumping, and our controller can remain stable. Fig.4 also shows our controller’s ability to imitate reference motion generated directly from a motion language model MDM [43]. We provide extensive qualitative results in our supplementary materials for our real-time use cases.
我们展示了控制器模仿视频中实时流式传输的姿势估计的能力。图 4 显示了使用从办公环境估计的姿势进行现场演示的定性结果。为了实现这一目标，我们以流方式使用基于关键点的控制器和 MeTRAbs 估计的关键点。演员执行一系列动作，例如摆姿势和跳跃，而我们的控制器可以保持稳定。图 4 还显示了我们的控制器模仿直接从运动语言模型 MDM 生成的参考运动的能力 [43]。我们在补充材料中为实时用例提供了广泛的定性结果。

Table 3: Ablation on components of our pipeline, performed using noisy pose estimate from HybrIK + Metrabs (root) on the H36M-Test-Video* data. RET: relaxed early termination. MCP: multiplicative control policy. PNN: progressive neural networks.
表 3：我们的流程组件的消融，使用来自 HybrIK + Metrabs（根）的 H36M-Test-Video* 数据的噪声姿态估计来执行。 RET：放松提前终止。 MCP：乘法控制策略。 PNN：渐进神经网络。

		H36M-Test-Video* H36M-测试视频*
RET	MCP	PNN	Rotation 回转	Fail-Recover 故障恢复	$\text{Succ}\uparrow$	$E_{\text{g-mpjpe}}\downarrow$	$E_{\text{mpjpe}}\downarrow$
✗	✗	✗	✓	✗	51.2%	56.2	34.4
✓	✗	✗	✓	✗	59.4%	60.2	37.2
✓	✓	✗	✓	✗	66.2%	59.0	38.3
✓	✓	✓	✓	✗	86.9%	53.1	33.7
✓	✓	✓	✓	✓	88.7%	55.4	34.7
✓	✓	✓	✗	✓	90.0%	55.8	41.0

4.2 Fail-state Recovery 4.2故障状态恢复

To evaluate our controller’s ability to recover from fail-state, we measure whether our controller can successfully reach the reference motion within a certain time frame. We consider three scenarios: 1) fallen on the ground, 2) far away from reference motion, and 3) fallen and far from reference. We use a single clip of standing-still reference motion during this evaluation. We generate fallen-states by dropping the humanoid on the ground and applying random joint torques for 150 time steps. We create the far-state by initializing the humanoid 3 meters from the reference motion. Experiments are run randomly 1000 trials. From Tab.4 we can see that both of our keypoint-based and rotation-based controllers can recover from fall state with high success rate ( $>$ 90%) even in the challenging scenario when the humanoid is both fallen and far away from the reference motion. For a more visual analysis of fail-state recovery, see our supplementary videos.
为了评估控制器从故障状态恢复的能力，我们测量控制器是否能够在特定时间范围内成功达到参考运动。我们考虑三种情况：1）跌倒在地上，2）远离参考运动，3）跌倒并远离参考。在此评估期间，我们使用静止参考运动的单个片段。我们通过将人形机器人扔到地上并在 150 个时间步长内施加随机关节扭矩来生成坠落状态。我们通过初始化距参考运动 3 米的人形机器人来创建远状态。实验随机进行 1000 次。从表 4 中我们可以看到，即使在人形机器人都跌倒的挑战性场景中，基于关键点和基于旋转的控制器都可以以很高的成功率（ $>$ 90%）从跌倒状态中恢复并且远离参考运动。有关故障状态恢复的更直观的分析，请参阅我们的补充视频。

Table 4: We measure whether our controller can recover from the fail-states by generating these scenarios (dropping the humanoid on the ground & far from the reference motion) and measuring the time it takes to resume tracking.
表 4：我们通过生成这些场景（将人形机器人掉落在地面上并远离参考运动）并测量恢复跟踪所需的时间来测量控制器是否可以从故障状态中恢复。

	Fallen-State 堕落状态		Far-State 远州		Fallen + Far-State 堕落+远州
Method 方法	$\text{Succ-5s}\uparrow$	$\text{Succ-10s}\uparrow$	$\text{Succ-5s}\uparrow$	$\text{Succ-10s}\uparrow$	$\text{Succ-5s}\uparrow$	$\text{Succ-10s}\uparrow$
Ours 我们的	95.0%	98.8%	83.7%	99.5%	93.4%	98.8%
Ours-kp 我们的-kp	92.5%	94.6%	95.1%	96.0%	79.4%	93.2%

5 Discussions 5讨论

Limitations 局限性

While our purposed PHC can imitate human motion from MoCap and noisy input faithfully, it does not achieve a 100% success rate on the training set. Upon inspection, we find that highly dynamic motions such as high jumping and back flipping are still challenging. Although we can train single-clip controller to overfit on these sequences (see the supplement), our full controller often fails to learn these sequences. We hypothesize that learning such highly dynamic clips (together with simpler motion) requires more planning and intent (e.g. running up to a high jump), which is not conveyed in the single-frame pose target ${\bm{\hat{q}}_{t+1}}$ for our controller. The training time is also long due to our progressive training procedure. Furthermore, to achieve better downstream tasks, the current disjoint process (where the video pose estimator is unaware of the physics simulation) may be insufficient; tighter integration with pose estimation [54, 23] and language-based motion generation [53] is needed.
虽然我们的目的 PHC 可以忠实地模仿来自 MoCap 的人体运动和噪声输入，但它在训练集上并没有达到 100% 的成功率。经过检查，我们发现跳高、后空翻等高动态动作仍然具有挑战性。尽管我们可以训练单剪辑控制器来过度拟合这些序列（请参阅补充），但我们的完整控制器通常无法学习这些序列。我们假设学习如此高度动态的剪辑（连同更简单的运动）需要更多的计划和意图（例如跑跳高），这在我们的控制器的单帧姿势目标 ${\bm{\hat{q}}_{t+1}}$ 中没有传达。由于我们的渐进式培训程序，培训时间也很长。此外，为了实现更好的下游任务，当前的不相交过程（视频姿态估计器不知道物理模拟）可能是不够的；需要与姿态估计[54, 23]和基于语言的运动生成[53]更紧密地集成。

Conclusion and Future Work
结论和未来的工作

We introduce Perpetual Humanoid Controller, a general purpose physics-based motion imitator that achieves high quality motion imitation while being able to recover from fail-states. Our controller is robust to noisy estimated motion from video and can be used to perpetually simulate a real-time avatar without requiring reset. Future directions include 1) improving imitation capability and learning to imitate 100% of the motion sequences of the training set; 2) incorporating terrain and scene awareness to enable human-object interaction; 3) tighter integration with downstream tasks such as pose estimation and motion generation, etc
我们推出了永久人形控制器，这是一种基于物理的通用运动模仿器，可以实现高质量的运动模仿，同时能够从故障状态中恢复。我们的控制器对视频中的噪声估计运动具有鲁棒性，可用于永久模拟实时化身而无需重置。未来的方向包括1）提高模仿能力，学习模仿训练集的100%的运动序列； 2）结合地形和场景感知以实现人与物体的交互； 3）与姿态估计和运动生成等下游任务更紧密地集成.

Acknowledgements 致谢

We thank Zihui Lin for her help in making the plots in this paper. Zhengyi Luo is supported by the Meta AI Mentorship (AIM) program.
我们感谢林子慧在本文中提供的帮助。罗正毅得到 Meta AI Mentorship (AIM) 计划的支持。

References 参考

[1] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY, USA, June 2009. Association for Computing Machinery.
约书亚·本吉奥、杰罗姆·洛拉杜尔、罗南·科洛伯特和杰森·韦斯顿。课程学习。第 26 届机器学习国际会议论文集，ICML '09，第 41-48 页，美国纽约州纽约市，2009 年 6 月。计算机协会。
[2] Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. DReCon: Data-driven responsive control of physics-based characters. ACM Trans. Graph., 38(6):11, 2019.
凯文·伯格明、西蒙·克拉维特、丹尼尔·霍尔登和詹姆斯·理查德·福布斯。 DReCon：基于物理的角色的数据驱动响应控制。 ACM 翻译。图，38（6）：11，2019。
[3] Glen Berseth, Cheng Xie, Paul Cernek, and Michiel Van de Panne. Progressive reinforcement learning with distillation for multi-skilled motion control. Feb. 2018.
[4] Jinkun Cao, Hongyang Tang, Hao-Shu Fang, Xiaoyong Shen, Cewu Lu, and Yu-Wing Tai. Cross-domain adaptation for animal pose estimation. Aug. 2019.
[5] Jinkun Cao, Xinshuo Weng, Rawal Khirodkar, Jiangmiao Pang, and Kris Kitani. Observation-centric SORT: Rethinking SORT for robust multi-object tracking. Mar. 2022.
[6] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. Nov. 2018.
[7] Nuttapong Chentanez, Matthias Müller, Miles Macklin, Viktor Makoviychuk, and Stefan Jeschke. Physics-based motion capture imitation with deep reinforcement learning. Proceedings - MIG 2018: ACM SIGGRAPH Conference on Motion, Interaction, and Games, 2018.
[8] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Trans. Pattern Anal. Mach. Intell., 44(7):3366–3385, July 2022.
[9] Robert M French and Nick Chater. Using noise to compute error surfaces in connectionist networks: a novel means of reducing catastrophic forgetting. Neural Comput., 14(7):1755–1769, July 2002.
[10] Zipeng Fu, Xuxin Cheng, and Deepak Pathak. Deep whole-body control: Learning a unified policy for manipulation and locomotion. Oct. 2022.
[11] Levi Fussell, Kevin Bergamin, and Daniel Holden. SuperTrack: motion tracking for physically simulated characters using supervised learning. ACM Trans. Graph., 40(6):1–13, Dec. 2021.
[12] Kehong Gong, Bingbing Li, Jianfeng Zhang, Tao Wang, Jing Huang, Michael Bi Mi, Jiashi Feng, and Xinchao Wang. PoseTriplet: Co-evolving 3D human pose estimation, imitation, and hallucination under self-supervision. CVPR, Mar. 2022.
[13] Leonard Hasenclever, Fabio Pardo, Raia Hadsell, Nicolas Heess, and Josh Merel. CoMic: Complementary task learning & mimicry for reusable skills. http://proceedings.mlr.press/v119/hasenclever20a/hasenclever20a.pdf. Accessed: 2023-2-13.
[14] Buzhen Huang, Liang Pan, Yuan Yang, Jingyi Ju, and Yangang Wang. Neural MoCon: Neural motion control for physically plausible human motion capture. Mar. 2022.
[15] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325–1339, 2014.
[16] Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, and Hao Su. Improving policy optimization with generalist-specialist learning. June 2022.
[17] Jocher, Glenn and Chaurasia, Ayush and Qiu, Jing. Yolov8.
[18] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. U. S. A., 114(13):3521–3526, 2017.
[19] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. Nov. 2020.
[20] Siqi Liu, Guy Lever, Zhe Wang, Josh Merel, S M Ali Eslami, Daniel Hennes, Wojciech M Czarnecki, Yuval Tassa, Shayegan Omidshafiei, Abbas Abdolmaleki, Noah Y Siegel, Leonard Hasenclever, Luke Marris, Saran Tunyasuvunakool, H Francis Song, Markus Wulfmeier, Paul Muller, Tuomas Haarnoja, Brendan D Tracey, Karl Tuyls, Thore Graepel, and Nicolas Heess. From motor control to team play in simulated humanoid football. May 2021.
[21] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model. ACM Trans. Graph., 34(6), 2015.
[22] Zhengyi Luo, Ryo Hachiuma, Ye Yuan, and Kris Kitani. Dynamics-regulated kinematic policy for egocentric pose estimation. NeurIPS, 34:25019–25032, 2021.
[23] Zhengyi Luo, Shun Iwase, Ye Yuan, and Kris Kitani. Embodied scene-aware human pose estimation. NeurIPS, June 2022.
[24] Zhengyi Luo, Ye Yuan, and Kris M Kitani. From universal humanoid control to automatic physically valid character creation. June 2022.
[25] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Gerard Pons-Moll, and Michael J Black. AMASS: Archive of motion capture as surface shapes. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob:5441–5450, 2019.
[26] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac gym: High performance GPU-based physics simulation for robot learning. Aug. 2021.
[27] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Gordon H Bower, editor, Psychology of Learning and Motivation, volume 24, pages 109–165. Academic Press, Jan. 1989.
[28] Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu Pham, Tom Erez, Greg Wayne, and Nicolas Heess. Catch and carry: Reusable neural controllers for vision-guided whole-body tasks. ACM Trans. Graph., 39(4), 2020.
[29] Hwangpil Park, Ri Yu, and Jehee Lee. Multi-segment foot modeling for human animation. In Proceedings of the 11th Annual International Conference on Motion, Interaction, and Games, number Article 16 in MIG ’18, pages 1–10, New York, NY, USA, Nov. 2018. Association for Computing Machinery.
[30] Soohwan Park, Hoseok Ryu, Seyoung Lee, Sunmin Lee, and Jehee Lee. Learning predict-and-simulate policies from unorganized human motion data. ACM Trans. Graph., 38(6):1–11, Nov. 2019.
[31] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. DeepMimic. ACM Trans. Graph., 37(4):1–14, 2018.
[32] Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Trans. Graph., 36(4):1–13, July 2017.
[33] Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. MCP: Learning composable hierarchical control with multiplicative compositional policies. May 2019.
[34] Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters. May 2022.
[35] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control. ACM Trans. Graph., (4):1–20, Apr. 2021.
[36] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv [cs.LG], June 2016.
[37] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. Technical report, 2017.
[38] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv [cs.LG], Jan. 2017.
[39] Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. PhysCap: Physically plausible monocular 3D motion capture in real time. (1), Aug. 2020.
[40] István Sárándi, Alexander Hermans, and Bastian Leibe. Learning 3D human pose estimation from dozens of datasets using a geometry-aware autoencoder to bridge between skeleton formats. Dec. 2022.
[41] István Sárándi, Timm Linder, Kai O Arras, and Bastian Leibe. MeTRAbs: Metric-scale truncation-robust heatmaps for absolute 3D human pose estimation. arXiv, pages 1–14, 2020.
[42] Tianxin Tao, Matthew Wilson, Ruiyu Gou, and Michiel van de Panne. Learning to get up. arXiv [cs.GR], Apr. 2022.
[43] Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Amit H Bermano, and Daniel Cohen-Or. Human motion diffusion model. arXiv [cs.CV], Sept. 2022.
[44] Nolan Wagener, Andrey Kolobov, Felipe Vieira Frujeri, Ricky Loynd, Ching-An Cheng, and Matthew Hausknecht. MoCapAct: A multi-task dataset for simulated humanoid control. Aug. 2022.
[45] Tingwu Wang, Yunrong Guo, Maria Shugrina, and Sanja Fidler. UniCon: Universal neural controller for physics-based character motion. arXiv, 2020.
[46] Alexander Winkler, Jungdam Won, and Yuting Ye. QuestSim: Human motion tracking from sparse sensors with simulated avatars. Sept. 2022.
[47] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. A scalable approach to control diverse behaviors for physically simulated characters. ACM Trans. Graph., 39(4), 2020.
[48] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Control strategies for physically simulated characters performing two-player competitive sports. ACM Trans. Graph., 40(4):1–11, July 2021.
[49] Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics-based character controllers using conditional VAEs. ACM Trans. Graph., 41(4):1–12, July 2022.
[50] Ye Yuan and Kris Kitani. 3D ego-pose estimation via imitation learning. In Computer Vision – ECCV 2018, volume 11220 LNCS, pages 763–778. Springer International Publishing, 2018.
[51] Ye Yuan and Kris Kitani. Ego-pose estimation and forecasting as real-time PD control. Proceedings of the IEEE International Conference on Computer Vision, 2019-Octob:10081–10091, 2019.
[52] Ye Yuan and Kris Kitani. Residual force control for agile human behavior imitation and extended motion synthesis. (NeurIPS), June 2020.
[53] Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. PhysDiff: Physics-guided human motion diffusion model. arXiv [cs.CV], Dec. 2022.
[54] Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Jason Saragih. SimPoE: Simulated character control for 3D human pose estimation. CVPR, Apr. 2021.
[55] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019-June:5738–5746, 2019.
[56] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Laudon. Mixture-of-experts with expert choice routing. Feb. 2022.
[57] Yuliang Zou, Jimei Yang, Duygu Ceylan, Jianming Zhang, Federico Perazzi, and Jia-Bin Huang. Reducing footskate in human motion reconstruction with ground contact constraints. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Mar. 2020.

Appendix \etocdepthtag.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection \etocsettocstyle

Appendices

Appendix A Introduction

In this document, we include additional details and results that are not included in the paper due to the page limit. In Sec.B, we include additional details for training, avatar use cases, and progressive neural networks (PNN) [36]. In Sec.C, we include additional ablation results. Finally, in Sec.D, we provide an extended discussion of limitations, failure cases, and future work.

Extensive qualitative results are provided on the project page. We highly encourage our readers to view them to better understand the capabilities of our method. Specifically, we show our method’s ability to imitate high-quality MoCap data (both train and test) and noisy motion estimated from video. We also demonstrate real-time video-based (single- and multi-person) and language-based avatar (single- and multiple-clips) use cases. Lastly, we showcase our fail-state recovery ability.

Appendix B Implementation Details

B.1 Training Details

Humanoid Construction

Our humanoid can be constructed from any kinematic structure, and we use the SMPL humanoid structure as it has native support for different body shapes and is widely adopted in the pose estimation literature. Fig.5 shows our humanoid constructed based on randomly selected gender and body shape from the AMASS dataset. The simulation result can then be exported and rendered as the SMPL mesh. We showcase two types of constructed humanoid: capsule-based and mesh-based. The capsule-based humanoid is constructed by treating body parts as simple geometric shapes (spheres, boxes, and capsules). The mesh-based humanoid is constructed following a procedure similar to SimPoE[54], where each body part is created by finding the convex hull of all vertices assigned to each bone. The capsule humanoid is easier to simulate and design, whereas the mesh humanoid provides a better approximation of the body shape to simulate more complex human-object interactions. We find that mesh-based and capsule-based humanoids do not have significant performance differences (see Sec.C) and conduct all experiments using the capsule-based humanoid. For a fair comparison with the baselines, we use the mean body shape of the SMPL with neutral gender for all evaluations and show qualitative results for shape variation. For both types of humanoids, we scale the density of geometric shapes so that the body has the correct weight (on average 70 kg). All inter-joint collisions are enabled for all joint pairs except for between parent and child joints. Collision between humanoids can be enabled and disabled at will (for multi-person use cases).

Training Process

During training, we randomly sample motion from the current training set $\bm{\hat{Q}}^{(k)}$ and normalize it with respect to the simulated body shape by performing forward kinematics using ${\bm{\hat{\theta}}_{1:T}}$ . Similar to UHC [22], we adjust the height of the root translation ${\bm{\hat{p}}_{t}}^{0}$ to make sure that each of the humanoid’s feet touches the ground at the beginning of the episode. We use parallelly simulate 1536 humanoids for training all of our primitives and composers. Training takes around 7 days to collect approximately 10 billion samples. When training with different body shapes, we randomly sample valid human body shapes from the AMASS dataset and construct humanoids from them. Hyperparamters used during training can be found in Table.5

Data Preparation

We follow similar procedure to UHC [22] to filter out AMASS sequences containing human object interactions. We remove all sequences that sits on chairs, move on treadmills, leans on tables, steps on stairs, floating in the air etc., resulting in 11313 high-quality motion sequences for training and 140 sequences for testing. We use a heuristic-based filtering process based on i.e. identifying the body joint configurations corresponding to the sitting motion or counting number of consecutive airborne frames.

Runtime

Once trained, our PHC can run in real time ( $\sim 32$ FPS) together with simulation and rendering, and around ( $\sim 50$ FPS) when run without rendering. Table.6 shows the runtime of our method with respect to the number of primitives, architecture, and humanoid type used.

Model Size

The final model size (with four primitives) is 28.8 MB, comparable to the model size of UHC (30.4 MB).

Table 5: Hyperparameters for PHC.

\sigma

: fixed variance for policy.

\gamma

: discount factor.

\epsilon

: clip range for PPO

	Batch Size	Learning Rate	$\sigma$	$\gamma$	$\epsilon$
Value	1536	$5\times 2^{-5}$	0.05	0.99	0.2
	$w_{\text{jp}}$	$w_{\text{jr}}$	$w_{\text{jv}}$	$w_{\text{j}\omega}$
Value	0.5	0.3	0.1	0.1

B.2 Real-time Use Cases

Real-time Physics-based Virtual Avatars from Video

To achieve real-time physics-based avatars driven by video, we first use Yolov8[17] for person detection. For pose estimation, we use MeTRAbS [41] and HybrIK [19] to provide 3D keypoints ${\bm{\tilde{p}}_{t}}$ and rotation ${\bm{\tilde{\theta}}_{t}}$ . MeTRAbs is a 3D keypoint estimator that computes 3D joint positions ${\bm{\tilde{p}}_{t}}$ in the absolute global space (rather than in the relative root space). HybrIK is a recent method for human mesh recovery and computes joint angles ${\bm{\tilde{\theta}}_{t}}$ and root position ${\bm{\tilde{p}}_{t}}^{0}$ for the SMPL human body. One can recover the 3D keypoints ${\bm{\tilde{p}}_{t}}$ from joint angles ${\bm{\tilde{\theta}}_{t}}$ and root position ${\bm{\tilde{p}}_{t}}^{0}$ using forward kinematics. Both of these methods are causal, do not use any temporal information, and can run in real-time ( $\sim 30$ FPS). Estimating 3D keypoint location from image pixels is an easier task than regressing joint angles, as 3D keypoints can be better associated with features learned from pixels. Thus, both HybrIK and MeTRAbs estimate 3D keypoints ${\bm{\tilde{p}}_{t}}$ , with HybrIK containing an additional step of performing learned inverse kinematics to recover joint angles ${\bm{\tilde{\theta}}_{t}}$ . We show results using both of these off-the-shelf pose estimation methods, using MeTRAbs with our keypoint-based controller and HybrIK with our rotation-based controller. Empirically, we find that MeTRAbs estimates more stable and accurate 3D keypoints, potentially due to its keypoint-only formulation. We also present a real-time multi-person physics-based human-to-human interaction use case, where we drive multiple avatars and enable inter-humanoid collision. To support multi-person pose estimation, we use OCSort [5] to track individual tracklets and associate poses with each person. Notice that real-time use cases pose additional challenges than offline processing: detection, pose/keypoint estimation, and simulation all need to run at real-time at around 30 FPS, and small fluctuations in framerate could lead to unstable imitation and simulation. To smooth out noisy depth estimates, we use a Gaussian filter to smooth out estimates from t-120 to t, and use the “mirror” setting for padding at boundary.

Virtual Avatars from Language

For language-based motion generation, we adopt MDM [43] as our text-to-motion model. We use the official implementation, which generates 3D keypoints ${\bm{\tilde{p}}_{t}}$ by default and connects it to our keypoint-based imitator. MDM generates fixed-length motion clips, so additional blending is needed to combine multiple clips of generated motion. However, since PHC can naturally go to far-away reference motion and handles disjoint between motion clips, we can naively chain together multiple clips of motion generated by MDM and create coherent and physically valid motion from multiple text prompts. This enables us to create a simulated avatar that can be driven by a continuous stream of text prompts.

B.3 Progressive Neural Network (PNN) Details

\bm{h}_{i}^{(k)}=f\left(\bm{W}_{i}^{(k)}\bm{h}_{i-1}^{(k)}+\sum_{j<k}\bm{U}_{i}^{(j:k)}\bm{h}_{i-1}^{(j)}\right).

(7)

Fig.6 visualizes the PNN with the lateral connection architecture. Essentially, except for the first layer, each subsequent layer receives the activation of the previous layer processed by the learnable connection matrices $\bm{U}_{i}^{(j:k)}$ . We do not use any adapter layer as in the original paper. As an alternative to lateral connection, we explore weight sharing and warm-starts the primitive with the weights from the previous one (as opposed to randomly initialized). We find both methods equally effective (see Sec.C) when trained with the same hard-negative mining procedure, as each newly learned primitive adds new sequences that PHC can imitate. The weight sharing strategy significantly decreases training time as the policy starts learning harder sequences with basic motor skills. We use weight sharing in all our main experiments.

Appendix C Supplementary Results

C.1 Categorizing the Forgetting Problem

As mentioned in the main paper, one of the main issues in learning to mimic a large motion dataset is the forgetting problem. The policy will learn new sequences while forgetting the ones already learned. In Fig.7, we visualize the sequences that the policy fails to imitate during training. Starting from the 12.5k epoch, each evaluation shows that some sequences are learned, but the policy will fail on some already learned sequences. The staircase pattern indicates that when learning sequences failed previously, the policy forgets already learned sequences. Numerically, each evaluation has around 30% overlap of failed sequences (right end side). The 30% overlap contains the backflips, cartwheeling, and acrobatics; motions that the policy consistently fails to learn when trained together with other sequences. We hypothesize that these remaining sequences (around 40) may require additional sequence-level information for the policy to learn properly together with other sequences.

Fail-state recovery Learning the fail-state recovery task can also lead to forgetting previously learned imitation skills. To verify this, we evaluate $\bm{\mathcal{P}}^{(F)}$ on the H36M-Test-Video dataset, which leads to a performance of Succ: 42.5%, $E_{\text{g-mpjpe}}$ : 87.3, and $E_{\text{mpjpe}}$ : 55.9, which is much lower than the single primitive $\bm{\mathcal{P}}^{(1)}$ performance of Succ: 59.4%, $E_{\text{g-mpjpe}}$ : 60.2, and $E_{\text{mpjpe}}$ : 34.4. Thus, learning the fail-state recovery task may lead to severe forgetting of the imitation task, motivating our PMCP framework to learn separate primitives for imitation and fail-state recovery.

C.2 Additional Ablations

In this section, we provide additional ablations of the components of our framework. Specifically, we study the effect of MOE vs. MCP, lateral connection vs. weight sharing, and the number of primitives used. We also report the inference speed (counting network inference and simulation time). All experiments are carried out with the rotation-based imitator and incorporate the fail state recovery primitive $\bm{\mathcal{P}}^{(F)}$ as the last primitive.

PNN Lateral Connection vs. Weight Sharing

As can be seen in Table 6, comparing Row 1 (R1) and R7 , we can see that PNN with lateral connection and weight sharing produce similar performance, both in terms of motion imitation and inference speed. This shows that in our setup, the weight sharing scheme is an effective alternative to lateral connections. This can be explained by the fact that in our case, each “task” on which the primitives are trained is similar and does not require lateral connection to choose whether to utilize prior experiences or not.

MOE vs. MCP

The difference between the top-1 mixture of experts (MOE) and multiplicative control (MCP) is discussed in detail in the MCP paper [33]: top-1 MOE only activates one expert at a time, while MCP can activate all primitives at the same time. Comparing R2 and R7, as expected, we can see that top-1 MOE is slightly inferior to MCP. Since all of our primitives are pretrained and frozen, theoretically a perfect composer should be able to choose the best primitive based on input for both MCP and MOE. MCP, compared to MOE, can activate all primitives at once and search a large action space where multiple primitives can be combined. Thus, MCP provides better performance, while MOE is not far behind. This is also observed by CoMic[13], where they observe similar performance between mixture and product distributions when used to combine subnetworks. Note that top-inf MOE is similar to MCP where all primitives can be activated.

Table 6: Supplmentary ablation on components of our pipeline, performed using noisy pose estimate from HybrIK + Metrabs (root) on the H36M-Test-Video* data. MOE: top-1 mixture of experts. MCP: multiplicative control policy. PNN: progressive neural networks. Type: between Cap (capsule) and mesh-based humanoids. All models are trained with the same procedure.

					H36M-Test-Video*
PNN-Lateral	PNN-Weight	MOE	MCP	Type	# Prim	$\text{Succ}\uparrow$	$E_{\text{g-mpjpe}}\downarrow$	$E_{\text{mpjpe}}\downarrow$	FPS
✓	✗	✗	✗	Cap	4	87.5%	55.7	36.2	32
✗	✓	✓	✗	Cap	4	87.5%	56.3	34.3	33
✗	✓	✗	✓	Mesh	4	86.9%	62.6	39.5	30
✗	✗	✗	✗	Cap	1	59.4%	60.2	37.2	32
✗	✓	✗	✓	Cap	2	65.6%	58.7	37.3	32
✗	✓	✗	✓	Cap	3	80.9%	56.8	36.1	32
✗	✓	✗	✓	Cap	4	88.7%	55.4	34.7	32
✗	✓	✗	✓	Cap	5	87.5%	57.7	36.0	32

Capsule vs. Mesh Humanoid

Comparing R3 and R7, we can see that mesh-based humanoid yield similar performance to capsule-based ones. It does slow down simulation by a small amount (30 FPS vs. 32 FPS), as simulating mesh is more compute-intensive than simulating simple geometries like capsules.

Number of primitives

Comparing R4, R5, R6, R7, and R8, we can see that the performance increases as the number of primitives increases. Since the last primitive $\bm{\mathcal{P}}^{(F)}$ is for fail-state recovery and does not provide motion imitation improvement, R5 is similar to the performance of models trained without PMCP (R4). As the number of primitives grows from 2 to 3, we can see that the model performance grows quickly, showing that MCP is effective in combining pretrained primitives to achieve motion imitation. Since we are using relatively small networks, the inference speed does not change significantly with the number of primitives used. We notice that as the number of primitives grows, $\bm{\hat{Q}}^{(k)}$ becomes more and more challenging. For instance, $\bm{\hat{Q}}^{(4)}$ contains mainly highly dynamic motions such as high-jumping, back flipping, and cartwheeling, which are increasingly difficult to learn together. We show that (see supplementary webpage) we can overfit these sequences by training on them only, yet it is significantly more challenging to learn them together. Motions that are highly dynamic require very specific steps to perform (such as moving while airborne to prepare for landing). Thus, the experiences collected when learning these sequences together may contradict each other: for example, a high jump may require a high speed running up, while a cartwheel may require a different setup of foot-movement. A per-frame policy that does not have sequence-level information may find it difficult to learn these sequences together. Thus, sequence-level or information about the future may be required to learn these high dynamic motions together. In general, we find that using 4 primitives is most effective in terms of training time and performance, so for our main evaluation and visualizations, we use 4-primitive models.

Appendix D Extended Limitation and Discussions

Limitation and Failure Cases

As discussed in the main paper, PHC has yet to achieve 100% success rate on the AMASS training set. With a 98.9% success rate, PHC can imitate most of our daily motion without losing balance, but can still struggle to perform more dynamic motions, such as backflipping. For our real-time avatar use cases, we can see a noticeable degradation in performance from the offline counterparts. This is due to the following:

•

Discontinuity and noise in reference motion. The inherent ambiguity in monocular depth estimation can result in noisy and jittery 3D keypoints, particularly in the depth dimension. These small errors, though sometimes imperceptible to the human eye, may provide PHC with incorrect movement signals, leaving insufficient time for appropriate reactions. Velocity estimation is also especially challenging in real-time use cases, and PHC relies on stable velocity estimation to infer movement cues.
•

Mismatched framerate. Since our PHC assumes 30 FPS motion input, it is essential for pose estimates from video to match for a more stable imitation. However, few pose estimators are designed to perform real-time pose estimation ( $\geq$ 30 FPS), and the estimation framerate can fluctuate due to external reasons, such as the load balance on computers.
•

For multi-person use case, tracking and identity switch can still happen, leading to a jarring experience where the humanoids need to switch places.

A deeper integration between the pose estimator and our controller is needed to further improve our real-time use cases. As we do not explicitly account for camera pose, we assume that the webcam is level with the ground and does not contain any pitch or roll. Camera height is manually adjusted at the beginning of the session. The pose of the camera can be taken into account in the pose estimation stage. Another area of improvement is naturalness during fail-state recovery. While our controller can recover from fail-state in a human-like fashion and walks back to resume imitation, the speed and naturalness could be further improved. Walking gait, speed, and tempo during fail-state recovery exhibits noticeable artifacts, such as asymmetric motion, a known artifact in AMP [35]. During the transition between fail-state recovery and motion imitation, the humanoid can suddenly jolt and snap into motion imitation. Further investigation (e.g. better reward than the point-goal formulation, additional observation about trajectory) is needed.

Discussion and Future Work

We propose the perpetual humanoid controller, a humanoid motion imitator capable of imitating large corpus of motion with high fidelity. Paired with its ability to recover from fail-state and go back to motion imitation, PHC is ideal for simulated avatar use cases where we no longer require reset during unexpected events. We pair PHC with a real-time pose estimator to show that it can be used in a video-based avatar use case, where the simulated avatar imitates motion performed by the actors perpetually without requiring reset. This can empower future virtual telepresence and remote work, where we can enable physically realistic human-to-human interactions. We also connect PHC to a language-based motion generator to demonstrate its ability to mimic generated motion from text. PHC can imitate multiple clips by performing motion inbetweening. Equipped with this ability, future work in embodied agents can be paired with a natural language processor to perform complex tasks. Our proposed PMCP can be used as a general framework to enable progressive RL and multi-task learning. In addition, we show that one can use only 3D keypoint as motion input for imitation, alleviating the requirement of estimating joint rotations. Essentially, we use PHC to perform inverse kinematics based on the input 3D keypoints and leverages the laws of physics to regulate its output. We believe that PHC can also be used in other areas such as embodied agents and grounding, where it can serve as a low-level controller for high-level reasoning functions.

Perpetual Humanoid Control for Real-time Simulated Avatars实时模拟化身的永久人形控制

Abstract 抽象的

1 Introduction 1简介

2 Related Works 2相关作品

Physics-based Motion Imitation基于物理的运动模仿

Fail-state Recovery for Simulated Characters模拟角色的故障状态恢复

Progressive Reinforcement Learning渐进强化学习

3 Method 3方法

3.1 Goal Conditioned Motion Imitation with Adversarial Motion Prior3.1 具有对抗性运动先验的目标条件运动模仿

State 状态

Reward 报酬

Action 行动

Control Policy and Discriminator控制策略和鉴别器

Humanoid 人形

Initialization and Relaxed Early Termination初始化和宽松的提前终止

Hard Negative Mining 硬负挖掘

3.2 Progressive Multiplicative Control Policy3.2渐进乘法控制策略

Progressive Neural Networks (PNN)渐进神经网络 (PNN)

Fail-state Recovery 故障状态恢复

Multiplicative Control 乘法控制

3.3 Connecting with Motion Estimators3.3与运动估计器连接

4 Experiments 4实验

Baselines 基线

Implementation Details 实施细节

Datasets 数据集

Metrics 指标

4.1 Motion Imitation 4.1动作模仿

Motion Imitation on High-quality MoCap高质量动捕上的运动模仿

Motion Imitation on Noisy Input from Video对视频噪声输入的运动模仿

Ablations 消融

Real-time Simulated Avatars实时模拟头像

4.2 Fail-state Recovery 4.2故障状态恢复

5 Discussions 5讨论

Limitations 局限性

Conclusion and Future Work结论和未来的工作

Acknowledgements 致谢

References 参考

Appendices

Appendix A Introduction

Appendix B Implementation Details

B.1 Training Details

Humanoid Construction

Training Process

Data Preparation

Runtime

Model Size

B.2 Real-time Use Cases

Real-time Physics-based Virtual Avatars from Video

Virtual Avatars from Language

B.3 Progressive Neural Network (PNN) Details

Appendix C Supplementary Results

C.1 Categorizing the Forgetting Problem

C.2 Additional Ablations

PNN Lateral Connection vs. Weight Sharing

MOE vs. MCP

Capsule vs. Mesh Humanoid

Number of primitives

Appendix D Extended Limitation and Discussions

Limitation and Failure Cases

Discussion and Future Work

Perpetual Humanoid Control for Real-time Simulated Avatars
实时模拟化身的永久人形控制

Physics-based Motion Imitation
基于物理的运动模仿

Fail-state Recovery for Simulated Characters
模拟角色的故障状态恢复

Progressive Reinforcement Learning
渐进强化学习

3.1 Goal Conditioned Motion Imitation with Adversarial Motion Prior
3.1 具有对抗性运动先验的目标条件运动模仿

Control Policy and Discriminator
控制策略和鉴别器

Initialization and Relaxed Early Termination
初始化和宽松的提前终止

3.2 Progressive Multiplicative Control Policy
3.2渐进乘法控制策略

Progressive Neural Networks (PNN)
渐进神经网络 (PNN)

3.3 Connecting with Motion Estimators
3.3与运动估计器连接

Motion Imitation on High-quality MoCap
高质量动捕上的运动模仿

Motion Imitation on Noisy Input from Video
对视频噪声输入的运动模仿

Real-time Simulated Avatars
实时模拟头像

Conclusion and Future Work
结论和未来的工作