这是用户在 2024-6-4 13:50 为 https://ar5iv.labs.arxiv.org/html/2305.06456?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Perpetual Humanoid Control for Real-time Simulated Avatars
实时模拟化身的永久人形控制

Zhengyi Luo1,2  Jinkun Cao2  Alexander Winkler1  Kris Kitani1,2  Weipeng Xu1
罗正义 1,2 曹金坤 2 亚历山大·温克勒 1 克里斯·基塔尼 1,2 徐伟鹏 1

1Reality Labs Research, Meta; 2Carnegie Mellon University
1 现实实验室研究,元; 2 卡内基梅隆大学

https://zhengyiluo.github.io/PHC/
Abstract 抽象的

We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.
我们提出了一种基于物理的人形控制器,可以在存在噪声输入(例如,来自视频或由语言生成的姿势估计)和意外跌倒的情况下实现高保真运动模仿和容错行为。我们的控制器可以在不使用任何外部稳定力的情况下扩展到学习一万个运动片段,并学会从故障状态自然恢复。给定参考运动,我们的控制器可以永久控制模拟化身,无需重置。其核心是,我们提出了渐进乘法控制策略(PMCP),它动态分配新的网络容量来学习越来越难的运动序列。 PMCP 允许高效扩展,以便从大规模运动数据库中学习并添加新任务(例如故障状态恢复),而不会发生灾难性遗忘。我们通过使用控制器在实时多人头像用例中模仿基于视频的姿势估计器和基于语言的运动生成器的噪声姿势来展示控制器的有效性。

[Uncaptioned image]
Figure 1: We propose a motion imitator that can naturally recover from falls and walk to far-away reference motion, perpetually controlling simulated avatars without requiring reset. Left: real-time avatars from video, where the blue humanoid recovers from a fall. Right: Imitating 3 disjoint clips of motion generated from language, where our controller fills in the blank. The color gradient indicates the passage of time.
图 1:我们提出了一种运动模仿器,它可以自然地从跌倒中恢复并走到远处的参考运动,永久控制模拟化身而不需要重置。左:视频中的实时头像,蓝色人形从跌倒中恢复过来。右图:模仿由语言生成的 3 个不相交的运动片段,我们的控制器填补了空白。颜色渐变表示时间的流逝。
\etocdepthtag \etoc深度标签

.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone
.tocmtchapter \etocsettag深度mtchaptersubsection \etocsettag深度mtappendixnone

1 Introduction 1简介

Physics-based motion imitation has captured the imagination of vision and graphics communities due to its potential for creating realistic human motion, enabling plausible environmental interactions, and advancing virtual avatar technologies of the future. However, controlling high-degree-of-freedom (DOF) humanoids in simulation presents significant challenges, as they can fall, trip, or deviate from their reference motions, and struggle to recover. For example, controlling simulated humanoids using poses estimated from noisy video observations can often lead humanoids to fall to the ground[50, 51, 22, 24]. These limitations prevent the widespread adoption of physics-based methods, as current control policies cannot handle noisy observations such as video or language.
基于物理的运动模仿因其在创造逼真的人体运动、实现合理的环境交互以及推进未来虚拟化身技术方面的潜力而吸引了视觉和图形界的想象力。然而,在模拟中控制高自由度 (DOF) 类人机器人面临着巨大的挑战,因为它们可能会跌倒、绊倒或偏离参考运动,并且难以恢复。例如,使用从嘈杂的视频观察中估计的姿势来控制模拟人形机器人通常会导致人形机器人跌倒在地[50,51,22,24]。这些限制阻碍了基于物理的方法的广泛采用,因为当前的控制策略无法处理视频或语言等嘈杂的观察结果。

In order to apply physically simulated humanoids for avatars, the first major challenge is learning a motion imitator (controller) that can faithfully reproduce human-like motion with a high success rate. While reinforcement learning (RL)-based imitation policies have shown promising results, successfully imitating motion from a large dataset, such as AMASS (ten thousand clips, 40 hours of motion), with a single policy has yet to be achieved. Attempts to use larger or a mixture of expert policies have been met with some success [45, 47], although they have not yet scaled to the largest dataset. Therefore, researchers have resorted to using external forces to help stabilize the humanoid. Residual force control (RFC) [52] has helped to create motion imitators that can mimic up to 97% of the AMASS dataset [22], and has seen successful applications in human pose estimation from video[54, 23, 12] and language-based motion generation [53]. However, the external force compromises physical realism by acting as a “hand of God” that puppets the humanoid, leading to artifacts such as flying and floating. One might argue that, with RFC, the realism of simulation is compromised, as the model can freely apply a non-physical force on the humanoid.
为了将物理模拟的类人机器人应用于化身,第一个主要挑战是学习一个能够以高成功率忠实地再现类人运动的运动模仿器(控制器)。虽然基于强化学习 (RL) 的模仿策略已显示出可喜的结果,但使用单一策略成功模仿大型数据集(例如 AMASS(一万个剪辑,40 小时的运动))的运动尚未实现。使用更大或混合专家策略的尝试已经取得了一些成功[45, 47],尽管它们尚未扩展到最大的数据集。因此,研究人员求助于使用外力来帮助稳定人形机器人。残余力控制 (RFC) [52] 帮助创建了运动模仿器,可以模仿高达 97% 的 AMASS 数据集 [22],并且在视频 [54,23,12] 和语言的人体姿势估计中得到了成功的应用基于运动生成[53]。然而,外力通过充当傀儡人形的“上帝之手”而损害了物理真实感,导致诸如飞行和漂浮之类的人工制品。有人可能会说,使用 RFC,模拟的真实性会受到影响,因为模型可以自由地对人形机器人施加非物理力。

Another important aspect of controlling simulated humanoids is how to handle noisy input and failure cases. In this work, we consider human poses estimated from video or language input. Especially with respect to video input, artifacts such as floating [53], foot sliding [57], and physically impossible poses are prevalent in popular pose estimation methods due to occlusion, challenging view point and lighting, fast motions etc. To handle these cases, most physics-based methods resort to resetting the humanoid when a failure condition is triggered [24, 22, 51]. However, resetting successfully requires a high-quality reference pose, which is often difficult to obtain due to the noisy nature of the pose estimates, leading to a vicious cycle of falling and resetting to unreliable poses. Thus, it is important to have a controller that can gracefully handle unexpected falls and noisy input, naturally recover from fail-state, and resume imitation.
控制模拟人形机器人的另一个重要方面是如何处理噪声输入和故障情况。在这项工作中,我们考虑从视频或语言输入估计的人体姿势。特别是在视频输入方面,由于遮挡、具有挑战性的视点和照明、快速运动等,诸如浮动[53]、脚滑动[57]和物理上不可能的姿势等伪像在流行的姿势估计方法中普遍存在。 ,大多数基于物理的方法在触发故障条件时都会重置人形机器人[24,22,51]。然而,成功重置需要高质量的参考姿势,由于姿势估计的噪声性质,通常很难获得高质量的参考姿势,从而导致跌倒和重置为不可靠姿势的恶性循环。因此,拥有一个能够优雅地处理意外跌倒和噪声输入、自然地从故障状态恢复并恢复模仿的控制器非常重要。

In this work, our aim is to create a humanoid controller specifically designed to control real-time virtual avatars, where video observations of a human user are used to control the avatar. We design the Perpetual Humanoid Controller (PHC), a single policy that achieves a high success rate on motion imitation and can recover from fail-state naturally. We propose a progressive multiplicative control policy (PMCP) to learn from motion sequences in the entire AMASS dataset without suffering catastrophic forgetting. By treating harder and harder motion sequences as a different “task” and gradually allocating new network capacity to learn, PMCP retains its ability to imitate easier motion clips when learning harder ones. PMCP also allows the controller to learn fail-state recovery tasks without compromising its motion imitation capabilities. Additionally, we adopt Adversarial Motion Prior (AMP)[35] throughout our pipeline and ensure natural and human-like behavior during fail-state recovery. Furthermore, while most motion imitation methods require both estimates of link position and rotation as input, we show that we can design controllers that require only the link positions. This input can be generated more easily by vision-based 3D keypoint estimators or 3D pose estimates from VR controllers.
在这项工作中,我们的目标是创建一个专门设计用于控制实时虚拟化身的人形控制器,其中使用人类用户的视频观察来控制化身。我们设计了永久人形控制器(PHC),它是一种单一策略,可以在运动模仿方面实现很高的成功率,并且可以自然地从失败状态中恢复。我们提出了一种渐进乘法控制策略(PMCP),可以从整个 AMASS 数据集中的运动序列中学习,而不会遭受灾难性遗忘。通过将越来越难的运动序列视为不同的“任务”并逐渐分配新的网络学习能力,PMCP 保留了在学习较难的运动片段时模仿较容易的运动片段的能力。 PMCP 还允许控制器学习故障状态恢复任务,而不会影响其运动模仿功能。此外,我们在整个流程中采用对抗性运动先验(AMP)[35],并确保故障状态恢复期间的自然和类似人类的行为。此外,虽然大多数运动模仿方法都需要链接位置和旋转的估计作为输入,但我们表明我们可以设计仅需要链接位置的控制器。通过基于视觉的 3D 关键点估计器或 VR 控制器的 3D 姿态估计可以更轻松地生成此输入。

To summarize, our contributions are as follows: (1) we propose a Perpetual Humanoid Controller that can successfully imitate 98.9% of the AMASS dataset without applying any external forces; (2) we propose the progressive multiplicative control policy to learn from a large motion dataset without catastrophic forgetting and unlock additional capabilities such as fail-state recovery; (3) our controller is task-agnostic and is compatible with off-the-shelf video-based pose estimators as a drop-in solution. We demonstrate the capabilities of our controller by evaluating on both Motion Capture (MoCap) and estimated motion from videos. We also show a live (30 fps) demo of driving perpetually simulated avatars using a webcam video as input.
总而言之,我们的贡献如下:(1)我们提出了一种永久人形控制器,可以在不施加任何外力的情况下成功模仿 98.9% 的 AMASS 数据集; (2)我们提出渐进乘法控制策略,从大型运动数据集中学习,而不会发生灾难性遗忘,并解锁诸如故障状态恢复等附加功能; (3) 我们的控制器与任务无关,并且与现成的基于视频的姿态估计器兼容,作为即插即用的解决方案。我们通过评估运动捕捉 (MoCap) 和视频中的估计运动来展示控制器的功能。我们还展示了使用网络摄像头视频作为输入来驾驶永久模拟化身的实时(30 fps)演示。

2 Related Works 2相关作品

Physics-based Motion Imitation
基于物理的运动模仿

Governed by the laws of physics, simulated characters [32, 31, 33, 35, 34, 7, 45, 52, 28, 13, 2, 11, 46, 12] have the distinct advantage of creating natural human motion, human-to-human interaction [20, 48], and human-object interactions [28, 34]. Since most modern physics simulators are not differentiable, training these simulated agents requires RL, which is time-consuming & costly. As a result, most of the work focuses on small-scale use cases such as interactive control based on user input [45, 2, 35, 34], playing sports [48, 20, 28], or other modular tasks (reaching goals [49], dribbling [35], moving around [32], etc.). On the other hand, imitating large-scale motion datasets is a challenging yet fundamental task, as an agent that can imitate reference motion can be easily paired with a motion generator to achieve different tasks. From learning to imitate a single clip [31] to datasets [47, 45, 7, 44], motion imitators have demonstrated their impressive ability to imitate reference motion, but are often limited to imitating high-quality MoCap data. Among them, ScaDiver [47] uses a mixture of expert policy to scale up to the CMU MoCap dataset and achieves a success rate of around 80% measured by time to failure. Unicon[45] shows qualitative results in imitation and transfer, but does not quantify the imitator’s ability to imitate clips from datasets. MoCapAct[44] first learns single-clip experts on the CMU MoCap dataset, and distills them into a single that achieves around 80% of the experts’ performance. The effort closest to ours is UHC [22], which successfully imitates 97% of the AMASS dataset. However, UHC uses residual force control [51], which applies a non-physical force at the root of the humanoid to help balance. Although effective in preventing the humanoid from falling, RFC reduces physical realism and creates artifacts such as floating and swinging, especially when motion sequences become challenging [22, 23]. Compared to UHC, our controller does not utilize any external force.
受物理定律支配,模拟角色[32,31,33,35,34,7,45,52,28,13,2,11,46,12]具有创造自然人体运动、人体运动的独特优势。人与人的交互[20, 48],以及人与物体的交互[28, 34]。由于大多数现代物理模拟器都是不可微分的,因此训练这些模拟代理需要强化学习,这既耗时又昂贵。因此,大部分工作都集中在小规模用例上,例如基于用户输入的交互控制[45,2,35,34],进行运动[48,20,28]或其他模块化任务(达到目标) [ 49]、运球 [ 35]、四处走动 [ 32] 等)。另一方面,模仿大规模运动数据集是一项具有挑战性但又基本的任务,因为可以模仿参考运动的代理可以轻松地与运动生成器配对以实现不同的任务。从学习模仿单个剪辑[31]到数据集[47,45,7,44],运动模仿者已经展示了他们模仿参考运动的令人印象深刻的能力,但通常仅限于模仿高质量的MoCap数据。其中,ScaDiver [47] 使用混合专家策略来扩展到 CMU MoCap 数据集,并实现了按失败时间衡量的 80% 左右的成功率。 Unicon[45]显示了模仿和迁移的定性结果,但没有量化模仿者模仿数据集中剪辑的能力。 MoCapAct[44]首先在 CMU MoCap 数据集上学习单片段专家,并将它们提炼成单个片段,以达到专家性能的 80% 左右。最接近我们的成果是 UHC [22],它成功模仿了 AMASS 数据集的 97%。然而,UHC 使用残余力控制 [51],它在人形机器人的根部施加非物理力来帮助平衡。 尽管 RFC 可以有效防止人形机器人跌倒,但它会降低物理真实感并产生漂浮和摆动等伪影,特别是当运动序列变得具有挑战性时 [22, 23]。与 UHC 相比,我们的控制器不利用任何外力。

Fail-state Recovery for Simulated Characters
模拟角色的故障状态恢复

As simulated characters can easily fall when losing balance, many approaches [39, 51, 34, 42, 7] have been proposed to help recovery. PhysCap [39] uses a floating-base humanoid that does not require balancing. This compromises physical realism, as the humanoid is no longer properly simulated. Egopose [51] designs a fail-safe mechanism to reset the humanoid to the kinematic pose when it is about to fall, leading to potential teleport behavior in which the humanoid keeps resetting to unreliable kinematic poses. NeruoMoCon [14] utilizes sampling-based control and reruns the sampling process if the humanoid falls. Although effective, this approach does not guarantee success and prohibits real-time use cases. Another natural approach is to use an additional recovery policy [7] when the humanoid has deviated from the reference motion. However, since such a recovery policy no longer has access to the reference motion, it produces unnatural behavior, such as high-frequency jitters. To combat this, ASE [34] demonstrates the ability to rise naturally from the ground for a sword-swinging policy. While impressive, in motion imitation the policy not only needs to get up from the ground, but also goes back to tracking the reference motion. In this work, we propose a comprehensive solution to the fail-state recovery problem in motion imitation: our PHC can rise from fallen state and naturally walks back to the reference motion and resume imitation.
由于模拟角色在失去平衡时很容易跌倒,因此提出了许多方法[39,51,34,42,7]来帮助恢复。 PhysCap [39] 使用不需要平衡的浮基人形机器人。这会损害物理真实感,因为人形机器人不再被正确模拟。 Egopose [51]设计了一种自动防故障机制,可在人形机器人即将跌倒时将其重置为运动学姿势,从而导致潜在的瞬移行为,其中人形机器人不断重置为不可靠的运动学姿势。 NeruoMoCon [14] 利用基于采样的控制,并在人形机器人跌倒时重新运行采样过程。尽管有效,但这种方法并不能保证成功,并且会阻碍实时用例。另一种自然的方法是当人形机器人偏离参考运动时使用额外的恢复策略[7]。然而,由于这种恢复策略不再能够访问参考运动,因此它会产生不自然的行为,例如高频抖动。为了解决这个问题,ASE [34]展示了从地面自然崛起的能力,以实施挥剑政策。虽然令人印象深刻,但在运动模仿中,该策略不仅需要从地面站起来,而且还需要返回跟踪参考运动。在这项工作中,我们提出了一种针对运动模仿中失败状态恢复问题的全面解决方案:我们的PHC可以从跌倒状态中上升,自然地走回参考运动并恢复模仿。

Progressive Reinforcement Learning
渐进强化学习

When learning from data containing diverse patterns, catastrophic forgetting [9, 27] is observed when attempting to perform multi-task or transfer learning by fine-tuning. Various approaches [8, 16, 18] have been proposed to combat this phenomenon, such as regularizing the weights of the network [18], learning multiple experts [16], or increasing the capacity using a mixture of experts [56, 38, 47] or multiplicative control [33]. A paradigm has been studied in transfer learning and domain adaption as progressive learning [6, 4] or curriculum learning [1]. Recently, progressive reinforcement learning [3] has been proposed to distill skills from multiple expert policies. It aims to find a policy that best matches the action distribution of experts instead of finding an optimal mix of experts. Progressive Neural Networks (PNN) [36] proposes to avoid catastrophic forgetting by freezing the weights of the previously learned subnetworks and initializing additional subnetworks to learn new tasks. The experiences from previous subnetworks are forwarded through lateral connections. PNN requires manually choosing which subnetwork to use based on the task, preventing it from being used in motion imitation since reference motion does not have the concept of task labels.
当从包含不同模式的数据中学习时,在尝试通过微调执行多任务或迁移学习时,会观察到灾难性遗忘 [9, 27]。人们提出了各种方法[8,16,18]来对抗这种现象,例如规范网络的权重[18],学习多个专家[16],或使用混合专家来增加容量[56,38, 47]或乘法控制[33]。迁移学习和领域适应的范式被研究为渐进式学习 [6, 4] 或课程学习 [1]。最近,有人提出渐进强化学习[3]来从多个专家策略中提取技能。它的目的是找到最匹配专家行动分布的策略,而不是寻找最佳的专家组合。渐进神经网络(PNN)[36]提出通过冻结先前学习的子网络的权重并初始化额外的子网络来学习新任务来避免灾难性遗忘。先前子网络的经验通过横向连接转发。 PNN需要根据任务手动选择使用哪个子网络,由于参考运动没有任务标签的概念,因此无法将其用于运动模仿。

3 Method 3方法

We define the reference pose as 𝒒^t(𝜽^t,𝒑^t)subscriptbold-^𝒒𝑡subscriptbold-^𝜽𝑡subscriptbold-^𝒑𝑡{\bm{\hat{q}}_{t}}\triangleq({\bm{\hat{\theta}}_{t}},{\bm{\hat{p}}_{t}}), consisting of 3D joint rotation 𝜽^tJ×6subscriptbold-^𝜽𝑡superscript𝐽6{\bm{\hat{\theta}}_{t}}\in\mathbb{R}^{J\times 6} and position 𝒑^tJ×3subscriptbold-^𝒑𝑡superscript𝐽3{\bm{\hat{p}}_{t}}\in\mathbb{R}^{J\times 3} of all J𝐽J links on the humanoid (we use the 6 DoF rotation representation [55]). From reference poses 𝒒^1:Tsubscriptbold-^𝒒:1𝑇{\bm{\hat{q}}_{1:T}}, one can compute the reference velocities 𝒒˙^1:Tsubscriptbold-^bold-˙𝒒:1𝑇{\bm{\hat{\dot{q}}}_{1:T}} through finite difference, where 𝒒˙^t(𝝎^t,𝒗^t)subscriptbold-^bold-˙𝒒𝑡subscriptbold-^𝝎𝑡subscriptbold-^𝒗𝑡{\bm{\hat{\dot{q}}}_{t}}\triangleq({\bm{\hat{\omega}}_{t}},{\bm{\hat{v}}_{t}}) consist of angular 𝝎^tJ×3subscriptbold-^𝝎𝑡superscript𝐽3{\bm{\hat{\omega}}_{t}}\in\mathbb{R}^{J\times 3} and linear velocities 𝒗^tJ×3subscriptbold-^𝒗𝑡superscript𝐽3{\bm{\hat{v}}_{t}}\in\mathbb{R}^{J\times 3}. We differentiate rotation-based and keypoint-based motion imitation by input: rotation-based imitation relies on reference poses 𝒒^1:Tsubscriptbold-^𝒒:1𝑇{\bm{\hat{q}}_{1:T}} (both rotation and keypoints), while keypoint-based imitation only requires 3D keypoints 𝒑^1:Tsubscriptbold-^𝒑:1𝑇{\bm{\hat{p}}_{1:T}}. As a notation convention, we use ~~\widetilde{\cdot} to represent kinematic quantities (without physics simulation) from pose estimator/keypoint detectors, ^^\widehat{\cdot} to denote ground truth quantities from Motion Capture (MoCap), and normal symbols without accents for values from the physics simulation. We use “imitate”, “track”, and “mimic” reference motion interchangeably. In Sec.3.1, we first set up the preliminary of our main framework. Sec.3.2 describes our progressive multiplicative control policy to learn to imitate a large dataset of human motion and recover from fail-states. Finally, in Sec.3.3, we briefly describe how we connect our task-agnostic controller to off-the-shelf video pose estimators and generators for real-time use cases.
我们将参考姿势定义为 𝒒^t(𝜽^t,𝒑^t)subscriptbold-^𝒒𝑡subscriptbold-^𝜽𝑡subscriptbold-^𝒑𝑡{\bm{\hat{q}}_{t}}\triangleq({\bm{\hat{\theta}}_{t}},{\bm{\hat{p}}_{t}}) ,由人形机器人上所有 J𝐽J 链接的 3D 关节旋转 𝜽^tJ×6subscriptbold-^𝜽𝑡superscript𝐽6{\bm{\hat{\theta}}_{t}}\in\mathbb{R}^{J\times 6} 和位置 𝒑^tJ×3subscriptbold-^𝒑𝑡superscript𝐽3{\bm{\hat{p}}_{t}}\in\mathbb{R}^{J\times 3} 组成(我们使用 6 DoF 旋转表示[55])。根据参考位姿 𝒒^1:Tsubscriptbold-^𝒒:1𝑇{\bm{\hat{q}}_{1:T}} ,可以通过有限差分计算参考速度 𝒒˙^1:Tsubscriptbold-^bold-˙𝒒:1𝑇{\bm{\hat{\dot{q}}}_{1:T}} ,其中 𝒒˙^t(𝝎^t,𝒗^t)subscriptbold-^bold-˙𝒒𝑡subscriptbold-^𝝎𝑡subscriptbold-^𝒗𝑡{\bm{\hat{\dot{q}}}_{t}}\triangleq({\bm{\hat{\omega}}_{t}},{\bm{\hat{v}}_{t}}) 由角 𝝎^tJ×3subscriptbold-^𝝎𝑡superscript𝐽3{\bm{\hat{\omega}}_{t}}\in\mathbb{R}^{J\times 3} 和线速度 < b8> 。我们通过输入区分基于旋转和基于关键点的运动模仿:基于旋转的模仿依赖于参考姿势 𝒒^1:Tsubscriptbold-^𝒒:1𝑇{\bm{\hat{q}}_{1:T}} (旋转和关键点),而基于关键点的模仿仅需要3D关键点 𝒑^1:Tsubscriptbold-^𝒑:1𝑇{\bm{\hat{p}}_{1:T}} 表示来自位姿估计器/关键点检测器的运动学量(没有物理模拟), ^^\widehat{\cdot} 表示来自运动捕捉 (MoCap) 的地面实况量,以及物理模拟中的值不带重音符号的正常符号。我们交替使用“模仿”、“跟踪”和“模仿”参考运动。在 Sec.3.1 中,我们首先建立了我们的主要框架的初步部分。第 3.2 节描述了我们的渐进乘法控制策略,用于学习模仿大型人体运动数据集并从失败状态中恢复。最后,在第 3.3 节中,我们简要描述了如何将与任务无关的控制器连接到现成的视频姿态估计器和生成器以实现实时用例。

3.1 Goal Conditioned Motion Imitation with Adversarial Motion Prior
3.1 具有对抗性运动先验的目标条件运动模仿

Our controller follows the general framework of goal-conditioned RL (Fig.3), where a goal-conditioned policy πPHCsubscript𝜋PHC{\pi_{\text{PHC}}} is tasked to imitate reference motion 𝒒^𝟏:𝒕subscriptbold-^𝒒bold-:1𝒕\bm{\hat{q}_{1:t}} or keypoints 𝒑^1:Tsubscriptbold-^𝒑:1𝑇{\bm{\hat{p}}_{1:T}}. Similar to prior work [22, 31], we formulate the task as a Markov Decision Process (MDP) defined by the tuple =𝓢,𝓐,𝓣,𝓡,γ𝓢𝓐𝓣𝓡𝛾{\mathcal{M}}=\langle\mathcal{\bm{S}},\mathcal{\bm{A}},\mathcal{\bm{T}},\bm{\mathcal{R}},\gamma\rangle of states, actions, transition dynamics, reward function, and discount factor. The physics simulation determines state 𝒔t𝓢subscript𝒔𝑡𝓢{\bm{s}_{t}}\in\mathcal{\bm{S}} and transition dynamics 𝓣𝓣\mathcal{\bm{T}} while our policy πPHCsubscript𝜋PHC{\pi_{\text{PHC}}} computes per-step action 𝒂t𝓐subscript𝒂𝑡𝓐\bm{a}_{t}\in\mathcal{\bm{A}}. Based on the simulation state 𝒔tsubscript𝒔𝑡{\bm{s}_{t}} and reference motion 𝒒^tsubscriptbold-^𝒒𝑡{\bm{\hat{q}}_{t}}, the reward function 𝓡𝓡\bm{\mathcal{R}} computes a reward rt=𝓡(𝒔t,𝒒^t)subscript𝑟𝑡𝓡subscript𝒔𝑡subscriptbold-^𝒒𝑡r_{t}=\bm{\mathcal{R}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}}) as the learning signal for our policy. The policy’s goal is to maximize the discounted reward 𝔼[t=1Tγt1rt]𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡1subscript𝑟𝑡\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t-1}r_{t}\right], and we use the proximal policy gradient (PPO) [37] to learn πPHCsubscript𝜋PHC{\pi_{\text{PHC}}}.
我们的控制器遵循目标条件强化学习的总体框架(图 3),其中目标条件策略 πPHCsubscript𝜋PHC{\pi_{\text{PHC}}} 的任务是模仿参考运动 𝒒^𝟏:𝒕subscriptbold-^𝒒bold-:1𝒕\bm{\hat{q}_{1:t}} 或关键点 𝒑^1:Tsubscriptbold-^𝒑:1𝑇{\bm{\hat{p}}_{1:T}} 定义的马尔可夫决策过程 (MDP)。物理模拟确定状态 𝒔t𝓢subscript𝒔𝑡𝓢{\bm{s}_{t}}\in\mathcal{\bm{S}} 和转换动态 𝓣𝓣\mathcal{\bm{T}} ,而我们的策略 πPHCsubscript𝜋PHC{\pi_{\text{PHC}}} 计算每步操作 𝒂t𝓐subscript𝒂𝑡𝓐\bm{a}_{t}\in\mathcal{\bm{A}} 。基于模拟状态 𝒔tsubscript𝒔𝑡{\bm{s}_{t}} 和参考运动 𝒒^tsubscriptbold-^𝒒𝑡{\bm{\hat{q}}_{t}} ,奖励函数 𝓡𝓡\bm{\mathcal{R}} 计算奖励 rt=𝓡(𝒔t,𝒒^t)subscript𝑟𝑡𝓡subscript𝒔𝑡subscriptbold-^𝒒𝑡r_{t}=\bm{\mathcal{R}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}}) 作为我们策略的学习信号。该策略的目标是最大化折扣奖励 𝔼[t=1Tγt1rt]𝔼delimited-[]superscriptsubscript𝑡1𝑇superscript𝛾𝑡1subscript𝑟𝑡\mathbb{E}\left[\sum_{t=1}^{T}\gamma^{t-1}r_{t}\right] ,我们使用近端策略梯度(PPO)[37]来学习 πPHCsubscript𝜋PHC{\pi_{\text{PHC}}}

Refer to caption
Figure 2: Our progressive training procedure to train primitives 𝓟(1),𝓟(2),,𝓟(K)superscript𝓟1superscript𝓟2superscript𝓟𝐾\bm{\mathcal{P}}^{(1)},\bm{\mathcal{P}}^{(2)},\cdots,\bm{\mathcal{P}}^{(K)} by gradually learning harder and harder sequences. Fail recovery 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)} is trained in the end on simple locomotion data; a composer is then trained to combine these frozen primitives.
图 2:我们通过逐渐学习越来越难的序列来训练基元 𝓟(1),𝓟(2),,𝓟(K)superscript𝓟1superscript𝓟2superscript𝓟𝐾\bm{\mathcal{P}}^{(1)},\bm{\mathcal{P}}^{(2)},\cdots,\bm{\mathcal{P}}^{(K)} 的渐进式训练过程。失败恢复 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)} 最终在简单的运动数据上进行训练;然后,作曲家接受训练来组合这些冻结的基元。
Refer to caption
Figure 3: Goal-conditioned RL framework with Adversarial Motion Prior. Each primitive 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} and composer 𝓒𝓒\bm{\mathcal{C}} is trained using the same procedure, and here we visualize the final product πPHCsubscript𝜋PHC{\pi_{\text{PHC}}}.
图 3:具有对抗性运动先验的目标条件强化学习框架。每个原语 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} 和作曲家 𝓒𝓒\bm{\mathcal{C}} 都使用相同的过程进行训练,在这里我们可视化最终产品 πPHCsubscript𝜋PHC{\pi_{\text{PHC}}}

State 状态

The simulation state 𝒔t(𝒔tp,𝒔tg)subscript𝒔𝑡subscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g𝑡{\bm{s}_{t}}\triangleq({\bm{s}^{\text{p}}_{t}},\bm{s}^{\text{g}}_{t}) consists of humanoid proprioception 𝒔tpsubscriptsuperscript𝒔p𝑡{\bm{s}^{\text{p}}_{t}} and the goal state 𝒔tgsubscriptsuperscript𝒔g𝑡{\bm{s}^{\text{g}}_{t}}. Proprioception 𝒔tp(𝒒t,𝒒˙t,𝜷)subscriptsuperscript𝒔p𝑡subscript𝒒𝑡subscriptbold-˙𝒒𝑡𝜷{\bm{s}^{\text{p}}_{t}}\triangleq({\bm{{q}}_{t}},{\bm{\dot{q}}_{t}},\bm{\beta}) contains the 3D body pose 𝒒tsubscript𝒒𝑡{\bm{{q}}_{t}}, velocity 𝒒˙tsubscriptbold-˙𝒒𝑡{\bm{\dot{q}}_{t}}, and (optionally) body shapes 𝜷𝜷\bm{\beta}. When trained with different body shapes, 𝜷𝜷\bm{\beta} contains information about the length of the limb of each body link [24]. For rotation-based motion imitation, the goal state 𝒔tgsubscriptsuperscript𝒔g𝑡{\bm{s}^{\text{g}}_{t}} is defined as the difference between the next time step reference quantitives and their simulated counterpart:
模拟状态 𝒔t(𝒔tp,𝒔tg)subscript𝒔𝑡subscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g𝑡{\bm{s}_{t}}\triangleq({\bm{s}^{\text{p}}_{t}},\bm{s}^{\text{g}}_{t}) 由人形本体感受 𝒔tpsubscriptsuperscript𝒔p𝑡{\bm{s}^{\text{p}}_{t}} 和目标状态 𝒔tgsubscriptsuperscript𝒔g𝑡{\bm{s}^{\text{g}}_{t}} 组成。本体感觉 𝒔tp(𝒒t,𝒒˙t,𝜷)subscriptsuperscript𝒔p𝑡subscript𝒒𝑡subscriptbold-˙𝒒𝑡𝜷{\bm{s}^{\text{p}}_{t}}\triangleq({\bm{{q}}_{t}},{\bm{\dot{q}}_{t}},\bm{\beta}) 包含 3D 身体姿势 𝒒tsubscript𝒒𝑡{\bm{{q}}_{t}} 、速度 𝒒˙tsubscriptbold-˙𝒒𝑡{\bm{\dot{q}}_{t}} 和(可选)身体形状 𝜷𝜷\bm{\beta} 。当使用不同的身体形状进行训练时, 𝜷𝜷\bm{\beta} 包含有关每个身体链接的肢体长度的信息[24]。对于基于旋转的运动模仿,目标状态 𝒔tgsubscriptsuperscript𝒔g𝑡{\bm{s}^{\text{g}}_{t}} 定义为下一时间步参考量与其模拟对应量之间的差异:

𝒔tg-rot(𝜽^t+1𝜽t,𝒑^t+1𝒑t,𝒗^t+1𝒗t,𝝎^t𝝎t,𝜽^t+1,𝒑^t+1)subscriptsuperscript𝒔g-rot𝑡symmetric-differencesubscriptbold-^𝜽𝑡1subscript𝜽𝑡subscriptbold-^𝒑𝑡1subscript𝒑𝑡subscriptbold-^𝒗𝑡1subscript𝒗𝑡subscriptbold-^𝝎𝑡subscript𝝎𝑡subscriptbold-^𝜽𝑡1subscriptbold-^𝒑𝑡1{\bm{s}^{\text{g-rot}}_{t}}\triangleq({\bm{\hat{\theta}}_{t+1}}\ominus{\bm{{\theta}}_{t}},{\bm{\hat{p}}_{t+1}}-{\bm{{p}}_{t}},\bm{\hat{v}}_{t+1}-\bm{v}_{t},\bm{\hat{\omega}}_{t}-\bm{\omega}_{t},{\bm{\hat{\theta}}_{t+1}},{\bm{\hat{p}}_{t+1}})

where symmetric-difference\ominus calculates the rotation difference. For keypoint-only imtiation, the goal state becomes
其中 symmetric-difference\ominus 计算旋转差。对于仅关键点模仿,目标状态变为

𝒔tg-kp(𝒑^t+1𝒑t,𝒗^t+1𝒗t,𝒑^t+1).subscriptsuperscript𝒔g-kp𝑡subscriptbold-^𝒑𝑡1subscript𝒑𝑡subscriptbold-^𝒗𝑡1subscript𝒗𝑡subscriptbold-^𝒑𝑡1{\bm{s}^{\text{g-kp}}_{t}}\triangleq({\bm{\hat{p}}_{t+1}}-{\bm{{p}}_{t}},\bm{\hat{v}}_{t+1}-{\bm{v}_{t}},{\bm{\hat{p}}_{t+1}}).

All of the above quantities in 𝒔tgsubscriptsuperscript𝒔g𝑡{\bm{s}^{\text{g}}_{t}} and 𝒔tpsubscriptsuperscript𝒔p𝑡{\bm{s}^{\text{p}}_{t}} are normalized with respect to the humanoid’s current facing direction and root position [49, 22].
𝒔tgsubscriptsuperscript𝒔g𝑡{\bm{s}^{\text{g}}_{t}}𝒔tpsubscriptsuperscript𝒔p𝑡{\bm{s}^{\text{p}}_{t}} 中的所有上述量均根据人形机器人当前的面向方向和根位置进行归一化 [49, 22]。

Reward 报酬

Unlike prior motion tracking policies that only use a motion imitation reward, we use the recently proposed Adversarial Motion Prior [35] and include a discriminator reward term throughout our framework. Including the discriminator term helps our controller produce stable and natural motion and is especially crucial in learning natural fail-state recovery behaviors. Specifically, our reward is defined as the sum of a task reward rtgsubscriptsuperscript𝑟g𝑡r^{\text{g}}_{t}, a style reward rtampsubscriptsuperscript𝑟amp𝑡r^{\text{amp}}_{t}, and an additional energy penalty rtenergysubscriptsuperscript𝑟energy𝑡r^{\text{energy}}_{t} [31]:
与仅使用运动模仿奖励的先前运动跟踪策略不同,我们使用最近提出的对抗性运动先验[35],并在整个框架中包含判别器奖励项。包含鉴别器项有助于我们的控制器产生稳定和自然的运动,对于学习自然故障状态恢复行为尤其重要。具体来说,我们的奖励被定义为任务奖励 rtgsubscriptsuperscript𝑟g𝑡r^{\text{g}}_{t} 、风格奖励 rtampsubscriptsuperscript𝑟amp𝑡r^{\text{amp}}_{t} 和额外能量惩罚 rtenergysubscriptsuperscript𝑟energy𝑡r^{\text{energy}}_{t} 的总和[31]:

rt=0.5rtg+0.5rtamp+rtenergy.subscriptsuperscript𝑟𝑡0.5subscriptsuperscript𝑟g𝑡0.5subscriptsuperscript𝑟amp𝑡subscriptsuperscript𝑟energy𝑡r^{\text{}}_{t}=0.5r^{\text{g}}_{t}+0.5r^{\text{amp}}_{t}+r^{\text{energy}}_{t}. (1)

For the discriminator, we use the same observations, loss formulation, and gradient penalty as AMP [35]. The energy penalty is expressed as 0.0005j joints |𝝁𝒋𝝎𝒋|20.0005subscript𝑗 joints superscriptsubscript𝝁𝒋subscript𝝎𝒋2-0.0005\cdot\sum_{j\in\text{ joints }}\left|\bm{\mu_{j}}\bm{\omega_{j}}\right|^{2} where 𝝁𝒋subscript𝝁𝒋\bm{\mu_{j}} and 𝝎𝒋subscript𝝎𝒋\bm{\omega_{j}} correspond to the joint torque and the joint angular velocity, respectively. The energy penalty [10] regulates the policy and prevents high-frequency jitter of the foot that can manifest in a policy trained without external force (see Sec.4.1). The task reward is defined based on the current training objective, which can be chosen by switching the reward function for motion imitation 𝓡imitationsuperscript𝓡imitation\bm{\mathcal{R}}^{\text{imitation}} and fail-state recovery 𝓡recoversuperscript𝓡recover\bm{\mathcal{R}}^{\text{recover}}. For motion tracking, we use:
对于判别器,我们使用与 AMP [35] 相同的观察结果、损失公式和梯度惩罚。能量损失表示为 0.0005j joints |𝝁𝒋𝝎𝒋|20.0005subscript𝑗 joints superscriptsubscript𝝁𝒋subscript𝝎𝒋2-0.0005\cdot\sum_{j\in\text{ joints }}\left|\bm{\mu_{j}}\bm{\omega_{j}}\right|^{2} ,其中 𝝁𝒋subscript𝝁𝒋\bm{\mu_{j}}𝝎𝒋subscript𝝎𝒋\bm{\omega_{j}} 分别对应于关节扭矩和关节角速度。能量惩罚[10]调节策略并防止脚的高频抖动,这种抖动可以在没有外力训练的策略中体现出来(参见第 4.1 节)。任务奖励是根据当前训练目标定义的,可以通过切换运动模仿 𝓡imitationsuperscript𝓡imitation\bm{\mathcal{R}}^{\text{imitation}} 和失败状态恢复 𝓡recoversuperscript𝓡recover\bm{\mathcal{R}}^{\text{recover}} 的奖励函数来选择。对于运动跟踪,我们使用:

rtg-imitation=𝓡imitation(𝒔t,𝒒^t)=wjpe100𝒑^t𝒑tsubscriptsuperscript𝑟g-imitation𝑡superscript𝓡imitationsubscript𝒔𝑡subscriptbold-^𝒒𝑡subscript𝑤jpsuperscript𝑒100normsubscriptbold-^𝒑𝑡subscript𝒑𝑡\displaystyle r^{\text{g-imitation}}_{t}=\bm{\mathcal{R}}^{\text{imitation}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}})=w_{\text{jp}}e^{-100\|\bm{\hat{p}}_{t}-\bm{p}_{t}\|} (2)
+wjre10𝒒^t𝒒t+wjve0.1𝒗^t𝒗t+wjωe0.1𝝎^t𝝎tsubscript𝑤jrsuperscript𝑒10normsymmetric-differencesubscriptbold-^𝒒𝑡subscript𝒒𝑡subscript𝑤jvsuperscript𝑒0.1normsubscriptbold-^𝒗𝑡subscript𝒗𝑡subscript𝑤j𝜔superscript𝑒0.1normsubscriptbold-^𝝎𝑡subscript𝝎𝑡\displaystyle+w_{\text{jr}}e^{-10\|\bm{\hat{q}}_{t}\ominus\bm{q}_{t}\|}+w_{\text{jv}}e^{-0.1\|\bm{\hat{v}}_{t}-\bm{v}_{t}\|}+w_{\text{j}\omega}e^{-0.1\|\bm{\hat{\omega}}_{t}-\bm{\omega}_{t}\|}

where we measure the difference between the translation, rotation, linear velocity, and angular velocity of the rigid body for all links in the humanoid. For fail-state recovery, we define the reward rtg-recoversubscriptsuperscript𝑟g-recover𝑡r^{\text{g-recover}}_{t} in Eq.3.
我们测量人形机器人中所有连杆刚体的平移、旋转、线速度和角速度之间的差异。对于失败状态恢复,我们在等式3中定义奖励 rtg-recoversubscriptsuperscript𝑟g-recover𝑡r^{\text{g-recover}}_{t}

Action 行动

We use a proportional derivative (PD) controller at each DoF of the humanoid and the action 𝒂tsubscript𝒂𝑡\bm{a}_{t} specifies the PD target. With the target joint set as 𝒒td=𝒂tsuperscriptsubscript𝒒𝑡𝑑subscript𝒂𝑡\bm{q}_{t}^{d}={\bm{a}_{t}}, the torque applied at each joint is 𝝉i=𝒌p(𝒂t𝒒t)𝒌d𝒒˙tsuperscript𝝉𝑖superscript𝒌𝑝subscript𝒂𝑡subscript𝒒𝑡superscript𝒌𝑑subscript˙𝒒𝑡\bm{\tau}^{i}=\bm{k}^{p}\circ\left({\bm{a}_{t}}-\bm{q}_{t}\right)-\bm{k}^{d}\circ\dot{\bm{q}}_{t}. Notice that this is different from the residual action representation [52, 22, 30] used in prior motion imitation methods, where the action is added to the reference pose: 𝒒td=𝒒^t+𝒂tsubscriptsuperscript𝒒𝑑𝑡subscript^𝒒𝑡subscript𝒂𝑡\bm{q}^{d}_{t}=\hat{\bm{q}}_{t}+\bm{a}_{t} to speed up training. As our PHC needs to remain robust to noisy and ill-posed reference motion, we remove such a dependency on reference motion in our action space. We do not use any external forces [52] or meta-PD control[54].
我们在人形机器人的每个 DoF 处使用比例微分 (PD) 控制器,并且动作 𝒂tsubscript𝒂𝑡\bm{a}_{t} 指定 PD 目标。将目标关节设置为 𝒒td=𝒂tsuperscriptsubscript𝒒𝑡𝑑subscript𝒂𝑡\bm{q}_{t}^{d}={\bm{a}_{t}} ,每个关节施加的扭矩为 𝝉i=𝒌p(𝒂t𝒒t)𝒌d𝒒˙tsuperscript𝝉𝑖superscript𝒌𝑝subscript𝒂𝑡subscript𝒒𝑡superscript𝒌𝑑subscript˙𝒒𝑡\bm{\tau}^{i}=\bm{k}^{p}\circ\left({\bm{a}_{t}}-\bm{q}_{t}\right)-\bm{k}^{d}\circ\dot{\bm{q}}_{t} 。请注意,这与先前运动模仿方法中使用的残留动作表示[52,22,30]不同,其中动作被添加到参考姿势: 𝒒td=𝒒^t+𝒂tsubscriptsuperscript𝒒𝑑𝑡subscript^𝒒𝑡subscript𝒂𝑡\bm{q}^{d}_{t}=\hat{\bm{q}}_{t}+\bm{a}_{t} 以加速训练。由于我们的 PHC 需要对噪声和不适定的参考运动保持鲁棒性,因此我们消除了对动作空间中参考运动的依赖。我们不使用任何外力[52]或元PD控制[54]。

Control Policy and Discriminator
控制策略和鉴别器

Our control policy πPHCsubscript𝜋PHC{\pi_{\text{PHC}}}(𝒂t|𝒔t)=𝒩(μ(𝒔t),σ)conditionalsubscript𝒂𝑡subscript𝒔𝑡𝒩𝜇subscript𝒔𝑡𝜎(\bm{a}_{t}|{\bm{s}_{t}})=\mathcal{N}(\mu({\bm{s}_{t}}),\sigma) represents a Gaussian distribution with fixed diagonal covariance. The AMP discriminator 𝓓(𝒔t10:tp)𝓓subscriptsuperscript𝒔p:𝑡10𝑡\bm{\mathcal{D}}(\bm{s}^{\text{p}}_{t-10:t}) computes a real and fake value based on the current prioproception of the humanoid. All of our networks (discriminator, primitive, value function, and discriminator) are two-layer multilayer perceptrons (MLP) with dimensions [1024, 512].
我们的控制策略 πPHCsubscript𝜋PHC{\pi_{\text{PHC}}} (𝒂t|𝒔t)=𝒩(μ(𝒔t),σ)conditionalsubscript𝒂𝑡subscript𝒔𝑡𝒩𝜇subscript𝒔𝑡𝜎(\bm{a}_{t}|{\bm{s}_{t}})=\mathcal{N}(\mu({\bm{s}_{t}}),\sigma) 表示具有固定对角协方差的高斯分布。 AMP 判别器 𝓓(𝒔t10:tp)𝓓subscriptsuperscript𝒔p:𝑡10𝑡\bm{\mathcal{D}}(\bm{s}^{\text{p}}_{t-10:t}) 根据人形机器人当前的先觉来计算真值和假值。我们所有的网络(鉴别器、原语、值函数和鉴别器)都是具有维度 [1024, 512] 的两层多层感知器 (MLP)。

Humanoid 人形

Our humanoid controller can support any human kinematic structure, and we use the SMPL [21] kinematic structure following prior arts [54, 22, 23]. The SMPL body contains 24 rigid bodies, of which 23 are actuated, resulting in an action space of 𝒂t23×3subscript𝒂𝑡superscript233\bm{a}_{t}\in\mathbb{R}^{23\times 3}. The body proportion can vary based on a body shape parameter β10𝛽superscript10\beta\in\mathbb{R}^{10}.
我们的人形控制器可以支持任何人体运动结构,并且我们使用遵循现有技术[54,22,23]的SMPL[21]运动结构。 SMPL 主体包含 24 个刚体,其中 23 个被驱动,导致动作空间为 𝒂t23×3subscript𝒂𝑡superscript233\bm{a}_{t}\in\mathbb{R}^{23\times 3} 。身体比例可以根据身体形状参数 β10𝛽superscript10\beta\in\mathbb{R}^{10} 而变化。

Initialization and Relaxed Early Termination
初始化和宽松的提前终止

We use reference state initialization (RSI) [31] during training and randomly select a starting point for a motion clip for imitation. For early termination, we follow UHC [22] and terminate the episode when the joints are more than 0.5 meters globally on average from the reference motion. Unlike UHC, we remove the ankle and toe joints from the termination condition. As observed by RFC [52], there exists a dynamics mismatch between simulated humanoids and real humans, especially since the real human foot is multisegment [29]. Thus, it is not possible for the simulated humanoid to have the exact same foot movement as MoCap, and blindly following the reference foot movement may lead to the humanoid losing balance. Thus, we propose Relaxed Early Termination (RET), which allows the humanoid’s ankle and toes to slightly deviate from the MoCap motion to remain balanced. Notice that the humanoid still receives imitation and discriminator rewards for these body parts, which prevents these joints from moving in a nonhuman manner. We show that though this is a small detail, it is conducive to achieving a good motion imitation success rate.
我们在训练期间使用参考状态初始化(RSI)[31],并随机选择运动剪辑的起点进行模仿。为了提前终止,我们遵循 UHC [22],并在关节全局平均距参考运动超过 0.5 米时终止该事件。与 UHC 不同,我们将踝关节和脚趾关节从终止条件中移除。正如 RFC [52] 所观察到的,模拟人形机器人和真实人类之间存在动态不匹配,特别是因为真实的人足是多节段的 [29]。因此,模拟人形机器人不可能具有与 MoCap 完全相同的足部运动,盲目跟随参考足部运动可能会导致人形动物失去平衡。因此,我们提出了放松提前终止(RET),它允许人形机器人的脚踝和脚趾稍微偏离MoCap运动以保持平衡。请注意,类人生物仍然会收到这些身体部位的模仿和鉴别器奖励,这会阻止这些关节以非人类的方式移动。我们表明,虽然这是一个小细节,但它有利于实现良好的动作模仿成功率。

Hard Negative Mining 硬负挖掘

When learning from a large motion dataset, it is essential to train on harder sequences in the later stages of training to gather more informative experiences. We use a similar hard negative mining procedure as in UHC [22] and define hard sequences by whether or not our controller can successfully imitate this sequence. From a motion dataset 𝑸^bold-^𝑸\bm{\hat{Q}}, we find hard sequences 𝑸^hard𝑸^subscriptbold-^𝑸hardbold-^𝑸\bm{\hat{Q}}_{\text{hard}}\subseteq\bm{\hat{Q}} by evaluating our model over the entire dataset and choosing sequences that our policy fails to imitate.
当从大型运动数据集学习时,有必要在训练的后期阶段对更难的序列进行训练,以收集更多信息丰富的经验。我们使用与 UHC [22] 中类似的硬负挖掘程序,并根据我们的控制器是否可以成功模仿该序列来定义硬序列。从运动数据集 𝑸^bold-^𝑸\bm{\hat{Q}} 中,我们通过在整个数据集上评估我们的模型并选择我们的策略无法模仿的序列来找到硬序列 𝑸^hard𝑸^subscriptbold-^𝑸hardbold-^𝑸\bm{\hat{Q}}_{\text{hard}}\subseteq\bm{\hat{Q}}

3.2 Progressive Multiplicative Control Policy
3.2渐进乘法控制策略

As training continues, we notice that the performance of the model plateaus as it forgets older sequences when learning new ones. Hard negative mining alleviates the problem to a certain extent, yet suffers from the same issue. Introducing new tasks, such as fail-state recovery, may further degrade imitation performance due to catastrophic forgetting. These effects are more concretely categorized in the Appendix (App. C). Thus, we propose a progressive multiplicative control policy (PMCP), which allocates new subnetworks (primitives 𝓟𝓟\bm{\mathcal{P}}) to learn harder sequences.
随着训练的继续,我们注意到模型的性能趋于稳定,因为它在学习新序列时忘记了旧序列。硬负挖矿在一定程度上缓解了这个问题,但也遇到了同样的问题。引入新任务(例如故障状态恢复)可能会因灾难性遗忘而进一步降低模仿性能。这些影响在附录(附录 C)中进行了更具体的分类。因此,我们提出了一种渐进乘法控制策略(PMCP),它分配新的子网络(基元 𝓟𝓟\bm{\mathcal{P}} )来学习更难的序列。

Progressive Neural Networks (PNN)
渐进神经网络 (PNN)

A PNN [36] starts with a single primitive network 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} trained on the full dataset 𝑸^bold-^𝑸\bm{\hat{Q}}. Once 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} is trained to convergence on the entire motion dataset 𝑸^bold-^𝑸\bm{\hat{Q}} using the imitation task, we create a subset of hard motions by evaluating 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} on 𝑸^bold-^𝑸\bm{\hat{Q}}. We define convergence as the success rate on 𝑸^hard(k)superscriptsubscriptbold-^𝑸hard𝑘\bm{\hat{Q}}_{\text{hard}}^{(k)} no longer increases. The sequences that 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} fails on is formed as 𝑸^hard(1)superscriptsubscriptbold-^𝑸hard1\bm{\hat{Q}}_{\text{hard}}^{(1)}. We then freeze the parameters of 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} and create a new primitive 𝓟(2)superscript𝓟2\bm{\mathcal{P}}^{(2)} (randomly initialized) along with lateral connections that connect each layer of 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} to 𝓟(2)superscript𝓟2\bm{\mathcal{P}}^{(2)}. For more information about PNN, please refer to our supplementary material. During training, we construct each 𝑸^hard(k)superscriptsubscriptbold-^𝑸hard𝑘\bm{\hat{Q}}_{\text{hard}}^{(k)} by selecting the failed sequences from the previous step 𝑸^hard(k1)superscriptsubscriptbold-^𝑸hard𝑘1\bm{\hat{Q}}_{\text{hard}}^{(k-1)}, resulting in a smaller and smaller hard subset: 𝑸^hard(k)𝑸^hard(k1)superscriptsubscriptbold-^𝑸hard𝑘superscriptsubscriptbold-^𝑸hard𝑘1\bm{\hat{Q}}_{\text{hard}}^{(k)}\subseteq\bm{\hat{Q}}_{\text{hard}}^{(k-1)}. In this way, we ensure that each newly initiated primitive 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} is responsible for learning a new and harder subset of motion sequences, as can be seen in Fig.2. Notice that this is different from hard-negative mining in UHC [22], as we initialize a new primitive 𝓟(k+1)superscript𝓟𝑘1\bm{\mathcal{P}}^{(k+1)} to train. Since the original PNN is proposed to solve completely new tasks (such as different Atari games), a lateral connection mechanism is proposed to allow later tasks to choose between reuse, modify, or discard prior experiences. However, mimicking human motion is highly correlated, where fitting to harder sequences 𝑸^hard(k)superscriptsubscriptbold-^𝑸hard𝑘\bm{\hat{Q}}_{\text{hard}}^{(k)} can effectively draw experiences from previous motor control experiences. Thus, we also consider a variant of PNN where there are no lateral connections, but the new primitives are initialized from the weights of the prior layer. This weight sharing scheme is similar to fine-tuning on the harder motion sequences using a new primitive 𝓟(k+1)superscript𝓟𝑘1\bm{\mathcal{P}}^{(k+1)} and preserve 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)}’s ability to imitate learned sequences.
PNN [36] 从在完整数据集 𝑸^bold-^𝑸\bm{\hat{Q}} 上训练的单个原始网络 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} 开始。使用模仿任务训练 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} 在整个运动数据集 𝑸^bold-^𝑸\bm{\hat{Q}} 上收敛后,我们通过在 𝑸^bold-^𝑸\bm{\hat{Q}} 来创建硬运动子集/b5> .我们将收敛定义为 𝑸^hard(k)superscriptsubscriptbold-^𝑸hard𝑘\bm{\hat{Q}}_{\text{hard}}^{(k)} 的成功率不再增加。 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} 失败的序列形成为 𝑸^hard(1)superscriptsubscriptbold-^𝑸hard1\bm{\hat{Q}}_{\text{hard}}^{(1)} 。然后,我们冻结 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} 的参数并创建一个新的原语 𝓟(2)superscript𝓟2\bm{\mathcal{P}}^{(2)} (随机初始化)以及将 𝓟(1)superscript𝓟1\bm{\mathcal{P}}^{(1)} 的每一层连接到 𝓟(2)superscript𝓟2\bm{\mathcal{P}}^{(2)} 来构建每个 𝑸^hard(k)superscriptsubscriptbold-^𝑸hard𝑘\bm{\hat{Q}}_{\text{hard}}^{(k)} ,从而产生越来越小的硬子集: 𝑸^hard(k)𝑸^hard(k1)superscriptsubscriptbold-^𝑸hard𝑘superscriptsubscriptbold-^𝑸hard𝑘1\bm{\hat{Q}}_{\text{hard}}^{(k)}\subseteq\bm{\hat{Q}}_{\text{hard}}^{(k-1)} 。通过这种方式,我们确保每个新启动的基元 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} 负责学习新的、更难的运动序列子集,如图 2 所示。请注意,这与 UHC [22] 中的硬负挖掘不同,因为我们初始化一个新的原语 𝓟(k+1)superscript𝓟𝑘1\bm{\mathcal{P}}^{(k+1)} 进行训练。由于最初的 PNN 是为了解决全新的任务(例如不同的 Atari 游戏)而提出的,因此提出了横向连接机制,以允许后续任务在重用、修改或丢弃先前的经验之间进行选择。然而,模仿人类运动是高度相关的,其中拟合更难的序列 𝑸^hard(k)superscriptsubscriptbold-^𝑸hard𝑘\bm{\hat{Q}}_{\text{hard}}^{(k)} 可以有效地从以前的运动控制经验中汲取经验。因此,我们还考虑 PNN 的一种变体,其中没有横向连接,但新基元是根据前一层的权重初始化的。 这种权重共享方案类似于使用新原语 𝓟(k+1)superscript𝓟𝑘1\bm{\mathcal{P}}^{(k+1)} 对较难的运动序列进行微调,并保留 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} 模仿学习序列的能力。

1 Function TrainPPO(π𝜋\pi, 𝐐^(k)superscriptbold-^𝐐𝑘\bm{\hat{Q}}^{(k)}, 𝓓𝓓\bm{\mathcal{D}}, 𝓥𝓥\bm{\mathcal{V}}, 𝓡𝓡\bm{\mathcal{R}}):
1函数TrainPPO( π𝜋\pi𝐐^(k)superscriptbold-^𝐐𝑘\bm{\hat{Q}}^{(k)}𝓓𝓓\bm{\mathcal{D}}𝓥𝓥\bm{\mathcal{V}}𝓡𝓡\bm{\mathcal{R}} ):
2       while not converged do
2 当不收敛时
3             𝑴𝑴{\bm{M}}\leftarrow\emptyset initialize sampling memory  ;
3 𝑴𝑴{\bm{M}}\leftarrow\emptyset 初始化采样内存 ;
4             while 𝐌𝐌{\bm{M}} not full do
4 当 𝐌𝐌{\bm{M}} 未满时执行
5                   𝒒^1:Tsubscriptbold-^𝒒:1𝑇absent{\bm{\hat{q}}_{1:T}}\leftarrow sample motion from 𝑸^bold-^𝑸\bm{\hat{Q}}  ;
5 𝒒^1:Tsubscriptbold-^𝒒:1𝑇absent{\bm{\hat{q}}_{1:T}}\leftarrow 来自 𝑸^bold-^𝑸\bm{\hat{Q}} 的运动样本;
6                   for t1T𝑡1𝑇t\leftarrow 1...T do
6 表示 t1T𝑡1𝑇t\leftarrow 1...T 执行
7                         𝒔t(𝒔tp,𝒔tg)subscript𝒔𝑡subscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g𝑡\bm{s}_{t}\leftarrow\left({\bm{s}^{\text{p}}_{t}},{\bm{s}^{\text{g}}_{t}}\right)  ;
7 𝒔t(𝒔tp,𝒔tg)subscript𝒔𝑡subscriptsuperscript𝒔p𝑡subscriptsuperscript𝒔g𝑡\bm{s}_{t}\leftarrow\left({\bm{s}^{\text{p}}_{t}},{\bm{s}^{\text{g}}_{t}}\right) ;
8                         𝒂tπ(𝒂t|𝒔t)subscript𝒂𝑡𝜋conditionalsubscript𝒂𝑡subscript𝒔𝑡\bm{a}_{t}\leftarrow{\pi}({\bm{a}_{t}}|{\bm{s}_{t}})  ;
8 𝒂tπ(𝒂t|𝒔t)subscript𝒂𝑡𝜋conditionalsubscript𝒂𝑡subscript𝒔𝑡\bm{a}_{t}\leftarrow{\pi}({\bm{a}_{t}}|{\bm{s}_{t}}) ;
9                         𝒔t+1𝒯(𝒔t+1|𝒔t,𝒂t)subscript𝒔𝑡1𝒯conditionalsubscript𝒔𝑡1subscript𝒔𝑡subscript𝒂𝑡\bm{s}_{t+1}\leftarrow\mathcal{T}(\bm{s}_{t+1}|\bm{s}_{t},\bm{a}_{t}) // simulation ;
9 𝒔t+1𝒯(𝒔t+1|𝒔t,𝒂t)subscript𝒔𝑡1𝒯conditionalsubscript𝒔𝑡1subscript𝒔𝑡subscript𝒂𝑡\bm{s}_{t+1}\leftarrow\mathcal{T}(\bm{s}_{t+1}|\bm{s}_{t},\bm{a}_{t}) // 模拟;
10                         𝒓t𝓡(𝒔t,𝒒^t+1)subscript𝒓𝑡𝓡subscript𝒔𝑡subscriptbold-^𝒒𝑡1\bm{r}_{t}\leftarrow{\bm{\mathcal{R}}}({\bm{s}_{t}},{\bm{\hat{q}}_{t+1}})  ;
10 𝒓t𝓡(𝒔t,𝒒^t+1)subscript𝒓𝑡𝓡subscript𝒔𝑡subscriptbold-^𝒒𝑡1\bm{r}_{t}\leftarrow{\bm{\mathcal{R}}}({\bm{s}_{t}},{\bm{\hat{q}}_{t+1}}) ;
11                         store (𝒔t,𝒂t,𝒓t,𝒔t+1)subscript𝒔𝑡subscript𝒂𝑡subscript𝒓𝑡subscript𝒔𝑡1(\bm{s}_{t},\bm{a}_{t},\bm{r}_{t},\bm{s}_{t+1}) into memory 𝑴𝑴{\bm{M}}  ;
11 将 (𝒔t,𝒂t,𝒓t,𝒔t+1)subscript𝒔𝑡subscript𝒂𝑡subscript𝒓𝑡subscript𝒔𝑡1(\bm{s}_{t},\bm{a}_{t},\bm{r}_{t},\bm{s}_{t+1}) 存入内存 𝑴𝑴{\bm{M}}
12                        
13                  
14            𝓟(k),𝓥superscript𝓟𝑘𝓥absent\bm{\mathcal{P}}^{(k)},\bm{\mathcal{V}}\leftarrow PPO update using experiences collected in 𝑴𝑴{\bm{M}}  ;
14 𝓟(k),𝓥superscript𝓟𝑘𝓥absent\bm{\mathcal{P}}^{(k)},\bm{\mathcal{V}}\leftarrow 使用 𝑴𝑴{\bm{M}} 中收集的经验更新 PPO;
15             𝓓𝓓absent\bm{\mathcal{D}}\leftarrow Discriminator update using experiences collected in 𝑴𝑴{\bm{M}}
15 𝓓𝓓absent\bm{\mathcal{D}}\leftarrow 使用 𝑴𝑴{\bm{M}} 中收集的经验更新判别器
16      return π𝜋\pi ;
16 返回 π𝜋\pi
17      
18
19 Input: Ground truth motion dataset 𝑸^bold-^𝑸\bm{\hat{Q}} ;
19输入:地面真实运动数据集 𝑸^bold-^𝑸\bm{\hat{Q}}
20
21𝓓𝓓\bm{\mathcal{D}}, 𝓥𝓥\bm{\mathcal{V}}, 𝑸^hard(1)𝑸^superscriptsubscriptbold-^𝑸hard1bold-^𝑸\bm{\hat{Q}}_{\text{hard}}^{(1)}\leftarrow\bm{\hat{Q}} // Initialize discriminator, value function, and dataset ;
21 𝓓𝓓\bm{\mathcal{D}} , 𝓥𝓥\bm{\mathcal{V}} , 𝑸^hard(1)𝑸^superscriptsubscriptbold-^𝑸hard1bold-^𝑸\bm{\hat{Q}}_{\text{hard}}^{(1)}\leftarrow\bm{\hat{Q}} // 初始化判别器、值函数和数据集;
22
23for k1K𝑘1𝐾k\leftarrow 1...K do
23对于 k1K𝑘1𝐾k\leftarrow 1...K
24       Initialize 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} // Lateral connection/weight sharing ;
24 初始化 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} // 横向连接/权重共享 ;
25       𝓟(k)superscript𝓟𝑘absent\bm{\mathcal{P}}^{(k)}\leftarrow TrainPPO(𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)}, 𝐐^hard(k+1)superscriptsubscriptbold-^𝐐hard𝑘1\bm{\hat{Q}}_{\text{hard}}^{(k+1)}, 𝓓𝓓\bm{\mathcal{D}}, 𝓥𝓥\bm{\mathcal{V}}, 𝓡imitationsuperscript𝓡imitation\bm{\mathcal{R}}^{\text{imitation}})  ;
25 𝓟(k)superscript𝓟𝑘absent\bm{\mathcal{P}}^{(k)}\leftarrow TrainPPO( 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} , 𝐐^hard(k+1)superscriptsubscriptbold-^𝐐hard𝑘1\bm{\hat{Q}}_{\text{hard}}^{(k+1)} , 𝓓𝓓\bm{\mathcal{D}} , 𝓥𝓥\bm{\mathcal{V}} , 𝓡imitationsuperscript𝓡imitation\bm{\mathcal{R}}^{\text{imitation}} ) ;
26       𝑸^hard(k+1)superscriptsubscriptbold-^𝑸hard𝑘1absent\bm{\hat{Q}}_{\text{hard}}^{(k+1)}\leftarrow eval( 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)}, 𝑸^(k)superscriptbold-^𝑸𝑘\bm{\hat{Q}}^{(k)} ) ;
26 𝑸^hard(k+1)superscriptsubscriptbold-^𝑸hard𝑘1absent\bm{\hat{Q}}_{\text{hard}}^{(k+1)}\leftarrow eval( 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)} , 𝑸^(k)superscriptbold-^𝑸𝑘\bm{\hat{Q}}^{(k)} ) ;
27       𝓟(k)superscript𝓟𝑘absent\bm{\mathcal{P}}^{(k)}\leftarrow freeze 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)}  ;
27 𝓟(k)superscript𝓟𝑘absent\bm{\mathcal{P}}^{(k)}\leftarrow 冻结 𝓟(k)superscript𝓟𝑘\bm{\mathcal{P}}^{(k)}
28      
29
30𝓟(F)superscript𝓟𝐹absent\bm{\mathcal{P}}^{(F)}\leftarrow TrainPPO(𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)}, 𝐐locosuperscript𝐐loco\bm{Q}^{\text{loco}}, 𝓓𝓓\bm{\mathcal{D}}, 𝓥𝓥\bm{\mathcal{V}}, 𝓡recoversuperscript𝓡recover\bm{\mathcal{R}}^{\text{recover}}) // Fail-state Recovery ;
30 𝓟(F)superscript𝓟𝐹absent\bm{\mathcal{P}}^{(F)}\leftarrow TrainPPO( 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)} , 𝐐locosuperscript𝐐loco\bm{Q}^{\text{loco}} , 𝓓𝓓\bm{\mathcal{D}} , 𝓥𝓥\bm{\mathcal{V}} , 𝓡recoversuperscript𝓡recover\bm{\mathcal{R}}^{\text{recover}} ) // 失败状态恢复;
31
32πPHC{𝓟(1)𝓟(K),𝓟(F),𝓒}subscript𝜋PHCsuperscript𝓟1superscript𝓟𝐾superscript𝓟𝐹𝓒{\pi_{\text{PHC}}}\leftarrow{\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)},\bm{\mathcal{C}}\}}  ; 32 πPHC{𝓟(1)𝓟(K),𝓟(F),𝓒}subscript𝜋PHCsuperscript𝓟1superscript𝓟𝐾superscript𝓟𝐹𝓒{\pi_{\text{PHC}}}\leftarrow{\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)},\bm{\mathcal{C}}\}} ;
33 πPHCsubscript𝜋PHCabsent{\pi_{\text{PHC}}}\leftarrow TrainPPO(πPHCsubscript𝜋PHC{\pi_{\text{PHC}}}, 𝐐^bold-^𝐐\bm{\hat{Q}}, 𝓓𝓓\bm{\mathcal{D}}, 𝓥𝓥\bm{\mathcal{V}}, {𝓡imitation,𝓡recover}superscript𝓡imitationsuperscript𝓡recover\{\bm{\mathcal{R}}^{\text{imitation}},\bm{\mathcal{R}}^{\text{recover}}\}) // Train Composer ;
33 πPHCsubscript𝜋PHCabsent{\pi_{\text{PHC}}}\leftarrow TrainPPO( πPHCsubscript𝜋PHC{\pi_{\text{PHC}}} , 𝐐^bold-^𝐐\bm{\hat{Q}} , 𝓓𝓓\bm{\mathcal{D}} , 𝓥𝓥\bm{\mathcal{V}} , {𝓡imitation,𝓡recover}superscript𝓡imitationsuperscript𝓡recover\{\bm{\mathcal{R}}^{\text{imitation}},\bm{\mathcal{R}}^{\text{recover}}\} ) // 训练作曲家;
Algo 1 Learn Progressive Multiplicative Control Policy
算法 1 学习渐进乘法控制策略
Refer to caption
Figure 4: (a) Imitating high-quality MoCap – spin and kick. (b) Recover from fallen state and go back to reference motion (indicated by red dots). (b) Imitating noisy motion estimated from video. (c) Imitating motion generated from language. (d) Using poses estimated from a webcam stream for a real-time simulated avatar.
图 4:(a) 模仿高质量 MoCap – 旋转和踢腿。 (b) 从跌倒状态恢复并返回参考运动(由红点表示)。 (b) 模仿从视频估计的噪声运动。 (c) 模仿语言产生的动作。 (d) 使用从网络摄像头流估计的姿势来实时模拟化身。

Fail-state Recovery 故障状态恢复

In addition to learning harder sequences, we also learn new tasks, such as recovering from fail-state. We define three types of fail-state: 1) fallen on the ground; 2) faraway from the reference motion (>0.5mabsent0.5𝑚>0.5m); 3) their combination: fallen and faraway. In these situations, the humanoid should get up from the ground, approach the reference motion in a natural way, and resume motion imitation. For this new task, we initialize a primitive 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)} at the end of the primitive stack. 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)} shares the same input and output space as 𝓟(1)𝓟(k)superscript𝓟1superscript𝓟𝑘\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(k)}, but since the reference motion does not provide useful information about fail-state recovery (the humanoid should not attempt to imitate the reference motion when lying on the ground), we modify the state space during fail-state recovery to remove all information about the reference motion except the root. For the reference joint rotation 𝜽^t=[𝜽^t0,𝜽^t1,𝜽^tJ]subscriptbold-^𝜽𝑡superscriptsubscriptbold-^𝜽𝑡0superscriptsubscriptbold-^𝜽𝑡1superscriptsubscriptbold-^𝜽𝑡𝐽{\bm{\hat{\theta}}_{t}}=[\bm{\hat{\theta}}_{t}^{0},\bm{\hat{\theta}}_{t}^{1},\cdots\bm{\hat{\theta}}_{t}^{J}] where 𝜽^tisuperscriptsubscriptbold-^𝜽𝑡𝑖\bm{\hat{\theta}}_{t}^{i} corresponds to the ithsuperscript𝑖thi^{\text{th}} joint, we construct 𝜽^t=[𝜽^t0,𝜽t1,𝜽tj]superscriptsubscriptbold-^𝜽𝑡superscriptsubscriptbold-^𝜽𝑡0superscriptsubscript𝜽𝑡1superscriptsubscript𝜽𝑡𝑗{\bm{\hat{\theta}}_{t}}^{\prime}=[\bm{\hat{\theta}}_{t}^{0},\bm{{\theta}}_{t}^{1},\cdots\bm{{\theta}}_{t}^{j}] where all joint rotations except the root are replaced with simulated values (without ^^\widehat{\cdot}). This amounts to setting the non-root joint goals to be identity when computing the goal states: 𝒔tg-Fail(𝜽^t𝜽t,𝒑^t𝒑t,𝒗^t𝒗t,𝝎^t𝝎t,𝜽^t,𝒑^t)subscriptsuperscript𝒔g-Fail𝑡symmetric-differencesuperscriptsubscriptbold-^𝜽𝑡subscript𝜽𝑡superscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡superscriptsubscriptbold-^𝒗𝑡subscript𝒗𝑡superscriptsubscriptbold-^𝝎𝑡subscript𝝎𝑡superscriptsubscriptbold-^𝜽𝑡superscriptsubscriptbold-^𝒑𝑡\bm{s}^{\text{g-Fail}}_{t}\triangleq(\bm{\hat{\theta}}_{t}^{\prime}\ominus\bm{\theta}_{t},\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t},\bm{\hat{v}}_{t}^{\prime}-\bm{v}_{t},\bm{\hat{\omega}}_{t}^{\prime}-\bm{\omega}_{t},\bm{\hat{\theta}}_{t}^{\prime},\bm{\hat{p}}_{t}^{\prime}). 𝒔tg-Failsubscriptsuperscript𝒔g-Fail𝑡\bm{s}^{\text{g-Fail}}_{t} thus collapse from an imitation objective to a point-goal [49] objective where the only information provided is the relative position and orientation of the target root. When the reference root is too far (>5mabsent5𝑚>5m), we normalize 𝒑^t𝒑tsuperscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t} as 5×(𝒑^t𝒑t)𝒑^t𝒑t25superscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡subscriptnormsuperscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡2\frac{5\times(\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t})}{\|\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t}\|_{2}} to clamp the goal position. Once the humanoid is close enough (e.g. <0.5mabsent0.5𝑚<0.5m ), the goal will switch back to full-motion imitation:
除了学习更难的序列之外,我们还学习新任务,例如从失败状态中恢复。我们定义了三种类型的失败状态:1)跌倒在地上; 2)远离参考运动( >0.5mabsent0.5𝑚>0.5m ); 3)他们的组合:堕落和遥远。在这些情况下,人形机器人应该从地面站起来,以自然的方式接近参考运动,然后恢复运动模仿。对于这个新任务,我们在基元堆栈末尾初始化一个基元 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)}𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)}𝓟(1)𝓟(k)superscript𝓟1superscript𝓟𝑘\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(k)} 共享相同的输入和输出空间,但由于参考运动不提供有关失败状态恢复的有用信息(人形机器人不应在以下情况下尝试模仿参考运动):躺在地上),我们在故障状态恢复期间修改状态空间,以删除除根之外的有关参考运动的所有信息。对于参考关节旋转 𝜽^t=[𝜽^t0,𝜽^t1,𝜽^tJ]subscriptbold-^𝜽𝑡superscriptsubscriptbold-^𝜽𝑡0superscriptsubscriptbold-^𝜽𝑡1superscriptsubscriptbold-^𝜽𝑡𝐽{\bm{\hat{\theta}}_{t}}=[\bm{\hat{\theta}}_{t}^{0},\bm{\hat{\theta}}_{t}^{1},\cdots\bm{\hat{\theta}}_{t}^{J}] ,其中 𝜽^tisuperscriptsubscriptbold-^𝜽𝑡𝑖\bm{\hat{\theta}}_{t}^{i} 对应于 ithsuperscript𝑖thi^{\text{th}} 关节,我们构造 𝜽^t=[𝜽^t0,𝜽t1,𝜽tj]superscriptsubscriptbold-^𝜽𝑡superscriptsubscriptbold-^𝜽𝑡0superscriptsubscript𝜽𝑡1superscriptsubscript𝜽𝑡𝑗{\bm{\hat{\theta}}_{t}}^{\prime}=[\bm{\hat{\theta}}_{t}^{0},\bm{{\theta}}_{t}^{1},\cdots\bm{{\theta}}_{t}^{j}] ,其中除根之外的所有关节旋转都被替换具有模拟值(没有 ^^\widehat{\cdot} )。这相当于在计算目标状态时将非根联合目标设置为同一: 𝒔tg-Fail(𝜽^t𝜽t,𝒑^t𝒑t,𝒗^t𝒗t,𝝎^t𝝎t,𝜽^t,𝒑^t)subscriptsuperscript𝒔g-Fail𝑡symmetric-differencesuperscriptsubscriptbold-^𝜽𝑡subscript𝜽𝑡superscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡superscriptsubscriptbold-^𝒗𝑡subscript𝒗𝑡superscriptsubscriptbold-^𝝎𝑡subscript𝝎𝑡superscriptsubscriptbold-^𝜽𝑡superscriptsubscriptbold-^𝒑𝑡\bm{s}^{\text{g-Fail}}_{t}\triangleq(\bm{\hat{\theta}}_{t}^{\prime}\ominus\bm{\theta}_{t},\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t},\bm{\hat{v}}_{t}^{\prime}-\bm{v}_{t},\bm{\hat{\omega}}_{t}^{\prime}-\bm{\omega}_{t},\bm{\hat{\theta}}_{t}^{\prime},\bm{\hat{p}}_{t}^{\prime})𝒔tg-Failsubscriptsuperscript𝒔g-Fail𝑡\bm{s}^{\text{g-Fail}}_{t} 因此从模仿目标崩溃为点目标[49]目标,其中提供的唯一信息是目标根的相对位置和方向。当参考根太远( >5mabsent5𝑚>5m )时,我们将 𝒑^t𝒑tsuperscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t} 标准化为 5×(𝒑^t𝒑t)𝒑^t𝒑t25superscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡subscriptnormsuperscriptsubscriptbold-^𝒑𝑡subscript𝒑𝑡2\frac{5\times(\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t})}{\|\bm{\hat{p}}_{t}^{\prime}-\bm{p}_{t}\|_{2}} 以限制目标位置。一旦人形物体足够接近(例如 <0.5mabsent0.5𝑚<0.5m ),目标将切换回全动作模仿:

𝒔tg={𝒔tg𝒑^t0𝒑t020.5𝒔tg-Failotherwise.subscriptsuperscript𝒔g𝑡casessubscriptsuperscript𝒔g𝑡subscriptnormsubscriptsuperscriptbold-^𝒑0𝑡subscriptsuperscript𝒑0𝑡20.5subscriptsuperscript𝒔g-Fail𝑡otherwise\bm{s}^{\text{g}}_{t}=\begin{cases}\bm{s}^{\text{g}}_{t}&\|\bm{\hat{p}}^{0}_{t}-\bm{p}^{0}_{t}\|_{2}\leq 0.5\\ \bm{s}^{\text{g-Fail}}_{t}&\text{otherwise}.\\ \end{cases} (3)

To create fallen states, we follow ASE [34] and randomly drop the humanoid on the ground at the beginning of the episode. The faraway state can be created by initializing the humanoid 2 similar-to\sim 5 meters from the reference motion. The reward for fail-state recovery consists of the AMP reward rtampsubscriptsuperscript𝑟amp𝑡r^{\text{amp}}_{t}, point-goal reward rtg-pointsubscriptsuperscript𝑟g-point𝑡r^{\text{g-point}}_{t}, and energy penality rtenergysubscriptsuperscript𝑟energy𝑡r^{\text{energy}}_{t}, calculated by the reward function 𝓡recoversuperscript𝓡recover\bm{\mathcal{R}}^{\text{recover}}:
为了创建堕落状态,我们遵循 ASE [34] 并在剧集开始时将人形机器人随机掉落在地上。可以通过在距离参考运动 5 米处初始化人形机器人 2 similar-to\sim 来创建远处状态。失败状态恢复的奖励由 AMP 奖励 rtampsubscriptsuperscript𝑟amp𝑡r^{\text{amp}}_{t} 、分数目标奖励 rtg-pointsubscriptsuperscript𝑟g-point𝑡r^{\text{g-point}}_{t} 和能量惩罚 rtenergysubscriptsuperscript𝑟energy𝑡r^{\text{energy}}_{t} 组成,由奖励函数 𝓡recoversuperscript𝓡recover\bm{\mathcal{R}}^{\text{recover}}

rtg-recover=𝓡recover(𝒔t,𝒒^t)=0.5rtg-point+0.5rtamp+0.1rtenergy,subscriptsuperscript𝑟g-recover𝑡superscript𝓡recoversubscript𝒔𝑡subscriptbold-^𝒒𝑡0.5subscriptsuperscript𝑟g-point𝑡0.5subscriptsuperscript𝑟amp𝑡0.1subscriptsuperscript𝑟energy𝑡r^{\text{g-recover}}_{t}=\bm{\mathcal{R}}^{\text{recover}}({\bm{s}_{t}},{\bm{\hat{q}}_{t}})=0.5r^{\text{g-point}}_{t}+0.5r^{\text{amp}}_{t}+0.1r^{\text{energy}}_{t}, (4)

The point-goal reward is formulated as rtg-point=(dt1dt)subscriptsuperscript𝑟g-point𝑡subscript𝑑𝑡1subscript𝑑𝑡r^{\text{g-point}}_{t}=\left(d_{t-1}-d_{t}\right) where dtsubscript𝑑𝑡d_{t} is the distance between the root reference and simulated root at the time step t𝑡t [49]. For training 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)}, we use a handpicked subset of the AMASS dataset named 𝑸locosuperscript𝑸loco\bm{Q}^{\text{loco}} where it contains mainly walking and running sequences. Learning using only 𝑸locosuperscript𝑸loco\bm{Q}^{\text{loco}} coaxes the discriminator 𝓓𝓓\bm{\mathcal{D}} and the AMP reward rtampsubscriptsuperscript𝑟amp𝑡r^{\text{amp}}_{t} to bias toward simple locomotion such as walking and running. We do not initialize a new value function and discriminator while training the primitives and continuously fine-tune the existing ones.
点目标奖励被公式化为 rtg-point=(dt1dt)subscriptsuperscript𝑟g-point𝑡subscript𝑑𝑡1subscript𝑑𝑡r^{\text{g-point}}_{t}=\left(d_{t-1}-d_{t}\right) ,其中 dtsubscript𝑑𝑡d_{t} 是时间步 t𝑡t 处的根参考和模拟根之间的距离[49]。对于训练 𝓟(F)superscript𝓟𝐹\bm{\mathcal{P}}^{(F)} ,我们使用名为 𝑸locosuperscript𝑸loco\bm{Q}^{\text{loco}} 的 AMASS 数据集精心挑选的子集,其中主要包含步行和跑步序列。仅使用 𝑸locosuperscript𝑸loco\bm{Q}^{\text{loco}} 进行学习会诱使鉴别器 𝓓𝓓\bm{\mathcal{D}} 和 AMP 奖励 rtampsubscriptsuperscript𝑟amp𝑡r^{\text{amp}}_{t} 偏向于简单的运动,例如步行和跑步。我们在训练原语时不会初始化新的价值函数和判别器,并不断微调现有的。

Multiplicative Control 乘法控制

Once each primitive has been learned, we obtain {𝓟(1)𝓟(K),𝓟(F)}superscript𝓟1superscript𝓟𝐾superscript𝓟𝐹\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)}\}, with each primitive capable of imitating a subset of the dataset 𝑸^bold-^𝑸\bm{\hat{Q}}. In Progressive Networks [36], task switching is performed manually. In motion imitation, however, the boundary between hard and easy sequences is blurred. Thus, we utilize Multiplicative Control Policy (MCP) [33] and train an additional composer 𝓒𝓒\bm{\mathcal{C}} to dynamically combine the learned primitives. Essentially, we use the pretrained primitives as a informed search space for the composer 𝓒𝓒\bm{\mathcal{C}}, and 𝓒𝓒\bm{\mathcal{C}} only needs to select which primitives to activate for imitation. Specifically, our composer 𝓒(𝒘t1:K+1|𝒔t)𝓒conditionalsubscriptsuperscript𝒘:1𝐾1𝑡subscript𝒔𝑡\bm{\mathcal{C}}(\bm{w}^{1:K+1}_{t}|{\bm{s}_{t}}) consumes the same input as the primitives and outputs a weight vector 𝒘t1:K+1k+1subscriptsuperscript𝒘:1𝐾1𝑡superscript𝑘1\bm{w}^{1:K+1}_{t}\in\mathbb{R}^{k+1} to activate the primitives. Combining our composer and primitives, we have the PHC’s output distribution:
一旦学习了每个基元,我们就获得 {𝓟(1)𝓟(K),𝓟(F)}superscript𝓟1superscript𝓟𝐾superscript𝓟𝐹\{\bm{\mathcal{P}}^{(1)}\cdots\bm{\mathcal{P}}^{(K)},\bm{\mathcal{P}}^{(F)}\} ,每个基元都能够模仿数据集 𝑸^bold-^𝑸\bm{\hat{Q}} 的子集。在渐进网络[36]中,任务切换是手动执行的。然而,在动作模仿中,困难序列和简单序列之间的界限是模糊的。因此,我们利用乘法控制策略(MCP)[33]并训练一个额外的组合器 𝓒𝓒\bm{\mathcal{C}} 来动态组合学习的原语。本质上,我们使用预训练的基元作为作曲家 𝓒𝓒\bm{\mathcal{C}} 的知情搜索空间,而 𝓒𝓒\bm{\mathcal{C}} 只需选择要激活哪些基元进行模仿。具体来说,我们的作曲家 𝓒(𝒘t1:K+1|𝒔t)𝓒conditionalsubscriptsuperscript𝒘:1𝐾1𝑡subscript𝒔𝑡\bm{\mathcal{C}}(\bm{w}^{1:K+1}_{t}|{\bm{s}_{t}}) 使用与基元相同的输入,并输出权重向量 𝒘t1:K+1k+1subscriptsuperscript𝒘:1𝐾1𝑡superscript𝑘1\bm{w}^{1:K+1}_{t}\in\mathbb{R}^{k+1} 来激活基元。结合我们的作曲家和原语,我们得到了 PHC 的输出分布:

πPHC(𝒂t𝒔t)=1𝓒(𝒔t)ik𝓟(i)(𝒂t(i)𝒔t)𝓒(𝒔t),𝓒(𝒔t)0.formulae-sequencesubscript𝜋PHCconditionalsubscript𝒂𝑡subscript𝒔𝑡1𝓒