For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than 1%1 \% of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback. 为了使复杂的强化学习(RL)系统能够有效地与现实世界环境互动,我们需要将复杂的目标传达给这些系统。在这项工作中,我们探讨了基于(非专家)人类偏好对轨迹片段对进行定义的目标。我们表明,这种方法可以有效地解决复杂的 RL 任务,而无需访问奖励函数,包括 Atari 游戏和模拟机器人运动,同时在我们代理与环境的互动中提供不到 1%1 \% 的反馈。这大大降低了人类监督的成本,使其可以实际应用于最先进的 RL 系统。为了展示我们方法的灵活性,我们证明我们可以用大约一个小时的人类时间成功训练复杂的新行为。这些行为和环境比以往从人类反馈中学习的任何行为和环境都要复杂得多。
1 Introduction 1 引言
Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function (Mnih et al. 2015, 2016, Silver et al., 2016). Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify. Overcoming this limitation would greatly expand the possible impact of deep RL and could increase the reach of machine learning more broadly. 最近在将强化学习(RL)扩展到大规模问题方面的成功,主要发生在具有明确奖励函数的领域(Mnih 等,2015,2016;Silver 等,2016)。不幸的是,许多任务涉及复杂、定义不清或难以具体化的目标。克服这一限制将大大扩展深度强化学习的潜在影响,并可能更广泛地提高机器学习的应用范围。
For example, suppose that we wanted to use reinforcement learning to train a robot to clean a table or scramble an egg. It’s not clear how to construct a suitable reward function, which will need to be a function of the robot’s sensors. We could try to design a simple reward function that approximately captures the intended behavior, but this will often result in behavior that optimizes our reward function without actually satisfying our preferences. This difficulty underlies recent concerns about misalignment between our values and the objectives of our RL systems (Bostrom, 2014, Russell, 2016, Amodei et al. 2016). If we could successfully communicate our actual objectives to our agents, it would be a significant step towards addressing these concerns. 例如,假设我们想使用强化学习来训练一个机器人清理桌子或打蛋。如何构建一个合适的奖励函数并不明确,这个函数需要依赖于机器人的传感器。我们可以尝试设计一个简单的奖励函数,大致捕捉预期的行为,但这往往会导致优化我们的奖励函数的行为,而实际上并没有满足我们的偏好。这一困难是最近关于我们价值观与强化学习系统目标之间不一致的担忧的根源(Bostrom, 2014; Russell, 2016; Amodei et al., 2016)。如果我们能够成功地将我们的实际目标传达给我们的代理,这将是解决这些担忧的重要一步。
If we have demonstrations of the desired task, we can extract a reward function using inverse reinforcement learning ( Ng and Russell, 2000). This reward function can then be used to train an agent with reinforcement learning. More directly, we can use imitation learning to clone the demonstrated behavior. However, these approaches are not directly applicable to behaviors that are difficult for humans to demonstrate (such as controlling a robot with many degrees of freedom but very non-human morphology). 如果我们有所需任务的示范,我们可以使用逆强化学习(Ng 和 Russell,2000)提取奖励函数。然后可以使用这个奖励函数来训练一个强化学习代理。更直接地,我们可以使用模仿学习来克隆示范的行为。然而,这些方法并不直接适用于人类难以示范的行为(例如控制一个具有许多自由度但形态非常非人类的机器人)。
An alternative approach is to allow a human to provide feedback on our system’s current behavior and to use this feedback to define the task. In principle this fits within the paradigm of reinforcement learning, but using human feedback directly as a reward function is prohibitively expensive for RL systems that require hundreds or thousands of hours of experience. In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude. 一种替代方法是允许人类对我们系统当前的行为提供反馈,并利用这些反馈来定义任务。从原则上讲,这符合强化学习的范式,但直接使用人类反馈作为奖励函数对于需要数百或数千小时经验的强化学习系统来说是极其昂贵的。为了实际训练带有人类反馈的深度强化学习系统,我们需要将所需反馈的数量减少几个数量级。
Our approach is to learn a reward function from human feedback and then to optimize that reward function. This basic approach has been considered previously, but we confront the challenges involved in scaling it up to modern deep RL and demonstrate by far the most complex behaviors yet learned from human feedback. 我们的方法是从人类反馈中学习一个奖励函数,然后优化该奖励函数。这种基本方法以前曾被考虑过,但我们面对将其扩展到现代深度强化学习所涉及的挑战,并展示了迄今为止从人类反馈中学习到的最复杂的行为。
In summary, we desire a solution to sequential decision problems without a well-specified reward function that 总之,我们希望找到一个解决没有明确奖励函数的序列决策问题的方案
enables us to solve tasks for which we can only recognize the desired behavior, but not necessarily demonstrate it, 使我们能够解决那些我们只能识别期望行为但不一定能够展示的任务
allows agents to be taught by non-expert users, 允许非专业用户教导代理
scales to large problems, and 扩展到大规模问题,并且
is economical with user feedback. 在用户反馈方面是经济的。
Our algorithm fits a reward function to the human’s preferences while simultaneously training a policy to optimize the current predicted reward function (see Figure 1). We ask the human to compare short video clips of the agent’s behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful. Moreover, we show that collecting feedback online improves the system’s performance and prevents it from exploiting weaknesses of the learned reward function. 我们的算法将奖励函数拟合到人类的偏好,同时训练一个策略以优化当前预测的奖励函数(见图 1)。我们要求人类比较代理行为的短视频片段,而不是提供绝对的数值评分。我们发现,在某些领域,比较更容易为人类提供,同时对学习人类偏好同样有用。比较短视频片段几乎与比较单个状态一样快,但我们展示了结果比较显著更有帮助。此外,我们还表明,在线收集反馈可以提高系统的性能,并防止其利用学习到的奖励函数的弱点。
Figure 1: Schematic illustration of our approach: the reward predictor is trained asynchronously from comparisons of trajectory segments, and the agent maximizes predicted reward. 图 1:我们方法的示意图:奖励预测器通过轨迹片段的比较进行异步训练,代理最大化预测的奖励。
Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator MuJoCo (Todorov et al., 2012). We show that a small amount of feedback from a non-expert human, ranging from fifteen minutes to five hours, suffices to learn most of the original RL tasks even when the reward function is not observable. We then consider some novel behaviors in each domain, such as performing a backflip or driving with the flow of traffic. We show that our algorithm can learn these behaviors from about an hour of feedback-even though it is unclear how to hand-engineer a reward function that would incentivize them. 我们的实验在两个领域进行:Atari 游戏的街机学习环境(Bellemare 等,2013)和物理模拟器 MuJoCo 中的机器人任务(Todorov 等,2012)。我们表明,来自非专家人类的少量反馈,从十五分钟到五小时,足以学习大多数原始强化学习任务,即使奖励函数不可观察。然后,我们考虑每个领域中的一些新颖行为,例如进行后空翻或与交通流量一起驾驶。我们展示了我们的算法可以从大约一个小时的反馈中学习这些行为——尽管尚不清楚如何手动设计一个能够激励这些行为的奖励函数。
1.1 Related Work 1.1 相关工作
A long line of work studies reinforcement learning from human ratings or rankings, including Akrour et al. (2011), Pilarski et al. (2011), Akrour et al. (2012), Wilson et al. (2012), Sugiyama et al. (2012), Wirth and Fürnkranz (2013), Daniel et al. (2015), El Asri et al. (2016), Wang et al. (2016), and Wirth et al. (2016). Other lines of research considers the general problem of reinforcement learning from preferences rather than absolute reward values (Fürnkranz et al., 2012; Akrour et al., 2014), and optimizing using human preferences in settings other than reinforcement learning (Machwe and Parmee, 2006, Secretan et al., 2008, Brochu et al., 2010, Sørensen et al., 2016). 一系列研究工作探讨了从人类评分或排名中进行强化学习,包括 Akrour 等人(2011),Pilarski 等人(2011),Akrour 等人(2012),Wilson 等人(2012),Sugiyama 等人(2012),Wirth 和 Fürnkranz(2013),Daniel 等人(2015),El Asri 等人(2016),Wang 等人(2016)和 Wirth 等人(2016)。其他研究方向则考虑了从偏好而非绝对奖励值进行强化学习的普遍问题(Fürnkranz 等人,2012;Akrour 等人,2014),以及在强化学习以外的环境中使用人类偏好进行优化(Machwe 和 Parmee,2006,Secretan 等人,2008,Brochu 等人,2010,Sørensen 等人,2016)。
Our algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014). They consider continuous domains with four degrees of freedom and small discrete domains, where they can assume that the reward is linear in the expectations of hand-coded features. We instead consider 我们的算法遵循与 Akrour 等人(2012 年)和 Akrour 等人(2014 年)相同的基本方法。他们考虑具有四个自由度的连续域和小的离散域,在这些域中,他们可以假设奖励在手动编码特征的期望中是线性的。我们则考虑
physics tasks with dozens of degrees of freedom and Atari tasks with no hand-engineered features; the complexity of our environments force us to use different RL algorithms and reward models, and to cope with different algorithmic tradeoffs. One notable difference is that Akrour et al. (2012) and Akrour et al. (2014) elicit preferences over whole trajectories rather than short clips. So although we gather about two orders of magnitude more comparisons, our experiments require less than one order of magnitude more human time. Other differences focus on changing our training procedure to cope with the nonlinear reward models and modern deep RL, for example using asynchronous training and ensembling. 物理任务具有数十个自由度,而 Atari 任务则没有手工设计的特征;我们环境的复杂性迫使我们使用不同的强化学习算法和奖励模型,并应对不同的算法权衡。一个显著的区别是,Akrour 等人(2012)和 Akrour 等人(2014)对整个轨迹而不是短片段进行偏好引导。因此,尽管我们收集了大约两个数量级更多的比较,但我们的实验所需的人力时间少于一个数量级。其他差异则集中在改变我们的训练程序,以应对非线性奖励模型和现代深度强化学习,例如使用异步训练和集成。
Our approach to feedback elicitation closely follows Wilson et al. (2012). However, Wilson et al. (2012) assumes that the reward function is the distance to some unknown “target” policy (which is itself a linear function of hand-coded features). They fit this reward function using Bayesian inference, and rather than performing RL they produce trajectories using the MAP estimate of the target policy. Their experiments involve “synthetic” human feedback which is drawn from their Bayesian model, while we perform experiments with feedback gathered from non-expert users. It is not clear if the methods in Wilson et al. (2012) can be extended to complex tasks or if they can work with real human feedback. 我们对反馈引导的处理方法紧密遵循 Wilson 等人(2012)的研究。然而,Wilson 等人(2012)假设奖励函数是与某个未知“目标”策略的距离(该策略本身是手动编码特征的线性函数)。他们使用贝叶斯推断来拟合这个奖励函数,而不是执行强化学习,而是使用目标策略的最大后验估计生成轨迹。他们的实验涉及从他们的贝叶斯模型中提取的“合成”人类反馈,而我们则进行与非专家用户收集的反馈的实验。目前尚不清楚 Wilson 等人(2012)的方法是否可以扩展到复杂任务,或者是否可以处理真实的人类反馈。
MacGlashan et al. (2017), Pilarski et al. (2011), Knox and Stone (2009), and Knox (2012) perform experiments involving reinforcement learning from actual human feedback, although their algorithmic approach is less similar. In MacGlashan et al. (2017) and Pilarski et al. (2011), learning only occurs during episodes where the human trainer provides feedback. This appears to be infeasible in domains like Atari games where thousands of hours of experience are required to learn a high-quality policy, and would be prohibitively expensive even for the simplest tasks we consider. TAMER (Knox, 2012, Knox and Stone, 2013) also learn a reward function, however they consider much simpler settings where the desired policy can be learned relatively quickly. MacGlashan 等人(2017),Pilarski 等人(2011),Knox 和 Stone(2009),以及 Knox(2012)进行的实验涉及从实际人类反馈中进行强化学习,尽管他们的算法方法不太相似。在 MacGlashan 等人(2017)和 Pilarski 等人(2011)中,学习仅在人类训练者提供反馈的情节中发生。这在像 Atari 游戏这样的领域似乎是不可行的,因为学习高质量策略需要数千小时的经验,即使是我们考虑的最简单任务也会成本过高。TAMER(Knox,2012,Knox 和 Stone,2013)也学习奖励函数,然而他们考虑的是可以相对快速学习所需策略的更简单设置。
Our work could also be seen of a specific instance of the cooperative inverse reinforcement learning framework (Hadfield-Menell et al., 2016). This framework considers a two-player game between a human and a robot interacting with an environment with the purpose of maximizing the human’s reward function. In our setting the human is only allowed to interact with this game by stating their preferences. 我们的工作也可以被视为合作逆强化学习框架的一个具体实例(Hadfield-Menell 等,2016)。该框架考虑了一个人类与机器人之间的双人游戏,目的是最大化人类的奖励函数。在我们的设置中,人类只能通过表达他们的偏好来与这个游戏互动。
Compared to all prior work, our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors. This fits into a recent trend of scaling reward learning methods to large deep learning systems, for example inverse RL (Finn et al., 2016), imitation learning (Ho and Ermon, 2016; Stadie et al., 2017), semi-supervised skill generalization (Finn et al., 2017), and bootstrapping RL from demonstrations (Silver et al., 2016, Hester et al., 2017). 与之前的所有工作相比,我们的关键贡献是将人类反馈扩展到深度强化学习,并学习更复杂的行为。这符合最近将奖励学习方法扩展到大型深度学习系统的趋势,例如逆强化学习(Finn 等,2016),模仿学习(Ho 和 Ermon,2016;Stadie 等,2017),半监督技能泛化(Finn 等,2017)以及从示范中自举强化学习(Silver 等,2016;Hester 等,2017)。
2 Preliminaries and Method 2 初步和方法
2.1 Setting and Goal 2.1 设置和目标
We consider an agent interacting with an environment over a sequence of steps; at each time tt the agent receives an observation o_(t)inOo_{t} \in \mathcal{O} from the environment and then sends an action a_(t)inAa_{t} \in \mathcal{A} to the environment. 我们考虑一个代理在一系列步骤中与环境互动;在每个时间点 tt ,代理从环境中接收一个观察 o_(t)inOo_{t} \in \mathcal{O} ,然后向环境发送一个动作 a_(t)inAa_{t} \in \mathcal{A} 。
In traditional reinforcement learning, the environment would also supply a reward r_(t)inRr_{t} \in \mathbb{R} and the agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. A trajectory segment is a sequence of observations and actions, sigma=((o_(0),a_(0)),(o_(1),a_(1)),dots,(o_(k-1),a_(k-1)))in(OxxA)^(k)\sigma=\left(\left(o_{0}, a_{0}\right),\left(o_{1}, a_{1}\right), \ldots,\left(o_{k-1}, a_{k-1}\right)\right) \in(\mathcal{O} \times \mathcal{A})^{k}. Write sigma^(1)>-sigma^(2)\sigma^{1} \succ \sigma^{2} to indicate that the human preferred trajectory segment sigma^(1)\sigma^{1} to trajectory segment sigma^(2)\sigma^{2}. Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. 在传统的强化学习中,环境也会提供奖励 r_(t)inRr_{t} \in \mathbb{R} ,代理的目标是最大化折扣奖励的总和。我们不再假设环境产生奖励信号,而是假设有一个人类监督者可以在轨迹段之间表达偏好。轨迹段是一系列观察和动作 sigma=((o_(0),a_(0)),(o_(1),a_(1)),dots,(o_(k-1),a_(k-1)))in(OxxA)^(k)\sigma=\left(\left(o_{0}, a_{0}\right),\left(o_{1}, a_{1}\right), \ldots,\left(o_{k-1}, a_{k-1}\right)\right) \in(\mathcal{O} \times \mathcal{A})^{k} 。写 sigma^(1)>-sigma^(2)\sigma^{1} \succ \sigma^{2} 以表示人类更喜欢轨迹段 sigma^(1)\sigma^{1} 而不是轨迹段 sigma^(2)\sigma^{2} 。非正式地说,代理的目标是生成人类偏好的轨迹,同时尽可能少地向人类询问。
More precisely, we will evaluate our algorithms’ behavior in two ways: 更准确地说,我们将通过两种方式评估我们算法的行为:
Quantitative: We say that preferences >-\succ are generated by a reward function ^(1)r:OxxArarrR{ }^{1} r: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R} if 定量:我们说偏好 >-\succ 是由奖励函数 ^(1)r:OxxArarrR{ }^{1} r: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R} 生成的,如果
If the human’s preferences are generated by a reward function rr, then our agent ought to receive a high total reward according to rr. So if we know the reward function rr, we can evaluate the agent quantitatively. Ideally the agent will achieve reward nearly as high as if it had been using RL to optimize rr. 如果人类的偏好是由奖励函数 rr 生成的,那么我们的代理应该根据 rr 获得高总奖励。因此,如果我们知道奖励函数 rr ,我们可以定量评估代理。理想情况下,代理将获得的奖励几乎与使用强化学习优化 rr 时一样高。
Qualitative: Sometimes we have no reward function by which we can quantitatively evaluate behavior (this is the situation where our approach would be practically useful). In these cases, all we can do is qualitatively evaluate how well the agent satisfies to the human’s preferences. In this paper, we will start from a goal expressed in natural language, ask a human to evaluate the agent’s behavior based on how well it fulfills that goal, and then present videos of agents attempting to fulfill that goal. 定性:有时我们没有奖励函数来定量评估行为(这是我们的方法在实践中有用的情况)。在这些情况下,我们所能做的就是定性评估代理满足人类偏好的程度。在本文中,我们将从用自然语言表达的目标开始,请人类根据代理实现该目标的程度来评估其行为,然后展示代理尝试实现该目标的视频。
Our model based on trajectory segment comparisons is very similar to the trajectory preference queries used in Wilson et al. (2012), except that we don’t assume that we can reset the system to an arbitrary state ^(2){ }^{2} and so our segments generally begin from different states. This complicates the interpretation of human comparisons, but we show that our algorithm overcomes this difficulty even when the human raters have no understanding of our algorithm. 基于轨迹段比较的我们的模型与 Wilson 等人(2012)使用的轨迹偏好查询非常相似,除了我们不假设可以将系统重置为任意状态 ^(2){ }^{2} ,因此我们的段通常从不同的状态开始。这使得人类比较的解释变得复杂,但我们展示了即使人类评估者对我们的算法没有理解,我们的算法仍然克服了这一困难。
2.2 Our Method 2.2 我们的方法
At each point in time our method maintains a policy pi:OrarrA\pi: \mathcal{O} \rightarrow \mathcal{A} and a reward function estimate hat(r):OxxArarrR\hat{r}: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R}, each parametrized by deep neural networks. 在每个时间点,我们的方法保持一个策略 pi:OrarrA\pi: \mathcal{O} \rightarrow \mathcal{A} 和一个奖励函数估计 hat(r):OxxArarrR\hat{r}: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R} ,这两者都是由深度神经网络参数化的。
These networks are updated by three processes: 这些网络通过三个过程进行更新:
The policy pi\pi interacts with the environment to produce a set of trajectories {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\}. The parameters of pi\pi are updated by a traditional reinforcement learning algorithm, in order to maximize the sum of the predicted rewards r_(t)= hat(r)(o_(t),a_(t))r_{t}=\hat{r}\left(o_{t}, a_{t}\right). 该政策 pi\pi 与环境互动以产生一组轨迹 {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\} 。 pi\pi 的参数通过传统的强化学习算法进行更新,以最大化预测奖励的总和 r_(t)= hat(r)(o_(t),a_(t))r_{t}=\hat{r}\left(o_{t}, a_{t}\right) 。
We select pairs of segments (sigma^(1),sigma^(2))\left(\sigma^{1}, \sigma^{2}\right) from the trajectories {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\} produced in step 1 , and send them to a human for comparison. 我们从步骤 1 中产生的轨迹 {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\} 中选择一对段 (sigma^(1),sigma^(2))\left(\sigma^{1}, \sigma^{2}\right) ,并将其发送给人类进行比较。
The parameters of the mapping hat(r)\hat{r} are optimized via supervised learning to fit the comparisons collected from the human so far. 映射 hat(r)\hat{r} 的参数通过监督学习进行优化,以适应迄今为止从人类收集的比较。
These processes run asynchronously, with trajectories flowing from process (1) to process (2), human comparisons flowing from process (2) to process (3), and parameters for hat(r)\hat{r} flowing from process (3) to process (1). The following subsections provide details on each of these processes. 这些过程是异步运行的,轨迹从过程(1)流向过程(2),人类比较从过程(2)流向过程(3),而 hat(r)\hat{r} 的参数则从过程(3)流向过程(1)。以下小节提供了每个过程的详细信息。
2.2.1 Optimizing the Policy 2.2.1 优化政策
After using hat(r)\hat{r} to compute rewards, we are left with a traditional reinforcement learning problem. We can solve this problem using any RL algorithm that is appropriate for the domain. One subtlety is that the reward function hat(r)\hat{r} may be non-stationary, which leads us to prefer methods which are robust to changes in the reward function. This led us to focus on policy gradient methods, which have been applied successfully for such problems (Ho and Ermon, 2016). 在使用 hat(r)\hat{r} 计算奖励后,我们面临一个传统的强化学习问题。我们可以使用任何适合该领域的 RL 算法来解决这个问题。一个细微之处是奖励函数 hat(r)\hat{r} 可能是非平稳的,这使我们更倾向于选择对奖励函数变化具有鲁棒性的方法。这使我们专注于策略梯度方法,这些方法已成功应用于此类问题(Ho 和 Ermon,2016)。
In this paper, we use advantage actor-critic (A2C; Mnih et al. 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al. 2015) to perform simulated robotics tasks. In 在本文中,我们使用优势演员-评论家(A2C;Mnih 等,2016)来玩 Atari 游戏,并使用信任区域策略优化(TRPO;Schulman 等,2015)来执行模拟机器人任务。
each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyperparameter which we adjusted was the entropy bonus for TRPO. This is because TRPO relies on the trust region to ensure adequate exploration, which can lead to inadequate exploration if the reward function is changing. 在每个案例中,我们使用了已被发现适用于传统强化学习任务的参数设置。我们唯一调整的超参数是 TRPO 的熵奖励。这是因为 TRPO 依赖于信任区域以确保足够的探索,如果奖励函数发生变化,可能会导致探索不足。
We normalized the rewards produced by hat(r)\hat{r} to have zero mean and constant standard deviation. This is a typical preprocessing step which is particularly appropriate here since the position of the rewards is underdetermined by our learning problem. 我们对 hat(r)\hat{r} 产生的奖励进行了归一化,使其均值为零,标准差为常数。这是一个典型的预处理步骤,特别适合这里,因为奖励的位置在我们的学习问题中是欠确定的。
2.2.2 Preference Elicitation 2.2.2 偏好引导
The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long. 人类监督者会看到两个轨迹段的可视化,以短视频片段的形式呈现。在我们所有的实验中,这些片段的长度在 1 到 2 秒之间。
The human then indicates which segment they prefer, that the two segments are equally good, or that they are unable to compare the two segments. 人类随后指明他们更喜欢哪个部分,两个部分同样好,或者他们无法比较这两个部分。
The human judgments are recorded in a database D\mathcal{D} of triples (sigma^(1),sigma^(2),mu)\left(\sigma^{1}, \sigma^{2}, \mu\right), where sigma^(1)\sigma^{1} and sigma^(2)\sigma^{2} are the two segments and mu\mu is a distribution over {1,2}\{1,2\} indicating which segment the user preferred. If the human selects one segment as preferable, then mu\mu puts all of its mass on that choice. If the human marks the segments as equally preferable, then mu\mu is uniform. Finally, if the human marks the segments as incomparable, then the comparison is not included in the database. 人类的判断记录在一个三元组数据库 D\mathcal{D} 中,其中 sigma^(1)\sigma^{1} 和 sigma^(2)\sigma^{2} 是两个部分, mu\mu 是一个分布,表示用户更喜欢哪个部分。如果人类选择一个部分作为更可取的,那么 mu\mu 会将所有的权重放在该选择上。如果人类将这些部分标记为同样可取,那么 mu\mu 是均匀的。最后,如果人类将这些部分标记为不可比较,那么该比较将不包含在数据库中。
2.2.3 Fitting the Reward Function 2.2.3 拟合奖励函数
We can interpret a reward function estimate hat(r)\hat{r} as a preference-predictor if we view hat(r)\hat{r} as a latent factor explaining the human’s judgments and assume that the human’s probability of preferring a segment sigma^(i)\sigma^{i} depends exponentially on the value of the latent reward summed over the length of the clip ^(3){ }^{3} 我们可以将奖励函数估计 hat(r)\hat{r} 解释为一个偏好预测器,如果我们将 hat(r)\hat{r} 视为解释人类判断的潜在因素,并假设人类偏好某个片段 sigma^(i)\sigma^{i} 的概率与潜在奖励的值呈指数关系,这个值是对片段长度 ^(3){ }^{3} 的总和
We choose hat(r)\hat{r} to minimize the cross-entropy loss between these predictions and the actual human labels: 我们选择 hat(r)\hat{r} 来最小化这些预测与实际人类标签之间的交叉熵损失:
This follows the Bradley-Terry model (Bradley and Terry, 1952) for estimating score functions from pairwise preferences, and is the specialization of the Luce-Shephard choice rule (Luce, 2005, Shepard, 1957) to preferences over trajectory segments. It can be understood as equating rewards with a preference ranking scale analogous to the famous Elo ranking system developed for chess (Elo 1978). Just as the difference in Elo points of two chess players estimates the probability of one player defeating the other in a game of chess, the difference in predicted reward of two trajectory segments estimates the probability that one is chosen over the other by the human. 这遵循布拉德利-特里模型(Bradley and Terry, 1952),用于从成对偏好中估计得分函数,并且是卢斯-谢泼德选择规则(Luce, 2005; Shepard, 1957)在轨迹段偏好上的特化。可以理解为将奖励等同于类似于著名的国际象棋埃洛排名系统(Elo 1978)的偏好排名尺度。正如两名国际象棋选手的埃洛积分差异估计一名选手在棋局中击败另一名选手的概率一样,两条轨迹段的预测奖励差异估计人类选择其中一条而非另一条的概率。
Our actual algorithm incorporates a number of modifications to this basic approach, which early experiments discovered to be helpful and which are analyzed in Section 3.3 我们的实际算法在这个基本方法上进行了多项修改,早期实验发现这些修改是有帮助的,具体分析见第 3.3 节。
We fit an ensemble of predictors, each trained on |D||\mathcal{D}| triples sampled from D\mathcal{D} with replacement. The estimate hat(r)\hat{r} is defined by independently normalizing each of these predictors and then averaging the results. 我们拟合了一组预测器,每个预测器在从 D\mathcal{D} 中有放回地抽样的 |D||\mathcal{D}| 个三元组上进行训练。估计 hat(r)\hat{r} 通过独立地对每个预测器进行归一化,然后平均结果来定义。
A fraction of 1//e1 / e of the data is held out to be used as a validation set for each predictor. We use ℓ_(2)\ell_{2} regularization and adjust the regularization coefficient to keep the validation loss between 1.1 and 1.5 times the training loss. In some domains we also apply dropout for regularization. 1//e1 / e 的一部分数据被保留作为每个预测器的验证集。我们使用 ℓ_(2)\ell_{2} 正则化,并调整正则化系数,以保持验证损失在训练损失的 1.1 到 1.5 倍之间。在某些领域,我们还应用 dropout 进行正则化。
Rather than applying a softmax directly as described in Equation 1, we assume there is a 10%10 \% chance that the human responds uniformly at random. Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t decay to 0 as the difference in reward difference becomes extreme. 与其直接应用如公式 1 所述的 softmax,我们假设人类以均匀随机的方式响应的概率为 10%10 \% 。从概念上讲,这种调整是必要的,因为人类评分者在犯错时具有恒定的概率,这种概率不会随着奖励差异的极端化而降至 0。
2.2.4 Selecting Queries 2.2.4 选择查询
We decide how to query preferences based on an approximation to the uncertainty in the reward function estimator, similar to Daniel et al. (2014): we sample a large number of pairs of trajectory segments of length kk, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members. This is a crude approximation and the ablation experiments in Section 3 show that in some tasks it actually impairs performance. Ideally, we would want to query based on the expected value of information of the query (Akrour et al., 2012; Krueger et al., 2016), but we leave it to future work to explore this direction further. 我们决定如何根据奖励函数估计器的不确定性近似来查询偏好,类似于 Daniel 等人(2014)的做法:我们从长度为 kk 的轨迹段中抽取大量对,使用我们集成中的每个奖励预测器来预测每对中哪个段会更受偏好,然后选择那些在集成成员中预测具有最高方差的轨迹。这是一个粗略的近似,第三节的消融实验表明,在某些任务中,它实际上会损害性能。理想情况下,我们希望根据查询的信息期望值进行查询(Akrour 等,2012;Krueger 等,2016),但我们将这一方向的进一步探索留给未来的工作。
3 Experimental Results 3 实验结果
We implemented our algorithm in TensorFlow (Abadi et al., 2016). We interface with MuJoCo (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al. 2013) through the OpenAI Gym (Brockman et al., 2016). 我们在 TensorFlow 中实现了我们的算法(Abadi 等,2016)。我们通过 OpenAI Gym(Brockman 等,2016)与 MuJoCo(Todorov 等,2012)和 Arcade Learning Environment(Bellemare 等,2013)进行接口。
3.1 Reinforcement Learning Tasks with Unobserved Rewards 3.1 具有未观察到奖励的强化学习任务
In our first set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. Instead, the agent learns about the goal of the task only by asking a human which of two trajectory segments is better. Our goal is to solve the task in a reasonable amount of time using as few queries as possible. 在我们的第一组实验中,我们尝试在不观察真实奖励的情况下解决一系列深度强化学习的基准任务。相反,智能体仅通过询问人类哪两个轨迹段更好来了解任务的目标。我们的目标是在合理的时间内使用尽可能少的查询来解决任务。
In our experiments, feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task (see Appendix B for the exact instructions given to contractors). Each trajectory segment is between 1 and 2 seconds long. Contractors responded to the average query in 3-53-5 seconds, and so the experiments involving real human feedback required between 30 minutes and 5 hours of human time. 在我们的实验中,承包商在被要求比较数百到数千对该任务的轨迹段之前,会收到每个任务的 1-2 句描述(具体指示见附录 B)。每个轨迹段的长度在 1 到 2 秒之间。承包商对平均查询的响应时间为 3-53-5 秒,因此涉及真实人类反馈的实验需要 30 分钟到 5 小时的人力时间。
For comparison, we also run experiments using a synthetic oracle whose preferences over trajectories exactly reflect reward in the underlying task. That is, when the agent queries for a comparison, instead of sending the query to a human, we immediately reply by indicating a preference for whichever trajectory segment actually receives a higher reward in the underlying task 4^(4)4^{4} We also compare to the baseline of RL training using the real reward. Our aim here is not to outperform but rather to do nearly as well as RL without access to reward information and instead relying on much scarcer feedback. Nevertheless, note that feedback from real humans does have the potential to outperform RL (and as shown below it actually does so on some tasks), because the human feedback might provide a better-shaped reward. 为了比较,我们还进行了实验,使用一个合成的 oracle,其对轨迹的偏好完全反映了基础任务中的奖励。也就是说,当代理请求比较时,我们并不是将查询发送给人类,而是立即回复,表明对实际在基础任务中获得更高奖励的轨迹段的偏好 4^(4)4^{4} 。我们还与使用真实奖励的 RL 训练基线进行了比较。我们在这里的目标不是超越,而是尽可能接近 RL,而不依赖于奖励信息,而是依赖于更稀缺的反馈。然而,请注意,来自真实人类的反馈确实有可能超越 RL(如下所示,在某些任务中确实如此),因为人类反馈可能提供更合理的奖励。
We describe the details of our experiments in Appendix A, including model architectures, modifications to the environment, and the RL algorithms used to optimize the policy. 我们在附录 A 中描述了实验的细节,包括模型架构、环境的修改以及用于优化策略的强化学习算法。
3.1.1 Simulated Robotics 3.1.1 模拟机器人技术
The first tasks we consider are eight simulated robotics tasks, implemented in MuJoCo (Todorov et al., 2012), and included in OpenAI Gym (Brockman et al., 2016). We made small modifications to these tasks in order to avoid encoding information about the task in the environment itself (the modifications are described in detail in Appendix A). The reward functions in these tasks are linear functions of distances, positions and velocities, and all are a quadratic function of the features. We included a simple cartpole task (“pendulum”) for comparison, since this is representative of the complexity of tasks studied in prior work. 我们考虑的第一个任务是八个模拟机器人任务,这些任务在 MuJoCo 中实现(Todorov 等,2012),并包含在 OpenAI Gym 中(Brockman 等,2016)。我们对这些任务进行了小的修改,以避免在环境中编码任务信息(修改的详细信息在附录 A 中描述)。这些任务中的奖励函数是距离、位置和速度的线性函数,并且所有函数都是特征的二次函数。我们包含了一个简单的倒立摆任务(“摆锤”)以作比较,因为这代表了先前研究中任务的复杂性。
Figure 2 shows the results of training our agent with 700 queries to a human rater, compared to learning from 350,700 , or 1400 synthetic queries, as well as to RL learning from the real reward. 图 2 显示了我们用 700 个查询训练代理的结果,与从 350,700 或 1400 个合成查询学习的结果,以及与从真实奖励进行强化学习的结果进行比较。
Figure 2: Results on MuJoCo simulated robotics as measured on the tasks’ true reward. We compare our method using real human feedback (purple), our method using synthetic feedback provided by an oracle (shades of blue), and reinforcement learning using the true reward function (orange). All curves are the average of 5 runs, except for the real human feedback, which is a single run, and each point is the average reward over five consecutive batches. For Reacher and Cheetah feedback was provided by an author due to time constraints. For all other tasks, feedback was provided by contractors unfamiliar with the environments and with our algorithm. The irregular progress on Hopper is due to one contractor deviating from the typical labeling schedule. 图 2:在 MuJoCo 模拟机器人上的结果,以任务的真实奖励为衡量标准。我们比较了使用真实人类反馈(紫色)、使用由神谕提供的合成反馈(蓝色阴影)以及使用真实奖励函数的强化学习(橙色)的方法。所有曲线都是 5 次运行的平均值,除了真实人类反馈,它是单次运行,每个点是五个连续批次的平均奖励。对于 Reacher 和 Cheetah,反馈是由一位作者提供的,因时间限制。对于所有其他任务,反馈是由对环境和我们的算法不熟悉的承包商提供的。Hopper 的不规则进展是由于一位承包商偏离了典型的标记时间表。
With 700 labels we are able to nearly match reinforcement learning on all of these tasks. Training with learned reward functions tends to be less stable and higher variance, while having a comparable mean performance. 通过 700 个标签,我们能够在所有这些任务上几乎匹配强化学习。使用学习到的奖励函数进行训练往往不太稳定,方差较高,但平均表现相当。
Surprisingly, by 1400 labels our algorithm performs slightly better than if it had simply been given the true reward, perhaps because the learned reward function is slightly better shaped-the reward learning procedure assigns positive rewards to all behaviors that are typically followed by high reward. 令人惊讶的是,到 1400 个标签时,我们的算法表现略好于仅仅给定真实奖励的情况,这可能是因为学习到的奖励函数形状稍好——奖励学习过程对所有通常伴随高奖励的行为分配正奖励。
Real human feedback is typically only slightly less effective than the synthetic feedback; depending on the task human feedback ranged from being half as efficient as ground truth feedback to being equally efficient. On the Ant task the human feedback significantly outperformed the synthetic feedback, apparently because we asked humans to prefer trajectories where the robot was “standing upright,” which proved to be useful reward shaping. (There was a similar bonus in the RL reward function to encourage the robot to remain upright, but the simple hand-crafted bonus was not as useful.) 真实的人类反馈通常仅比合成反馈稍微低效;根据任务的不同,人类反馈的效率从只有真实反馈的一半到与真实反馈同样高效。在蚂蚁任务中,人类反馈显著优于合成反馈,显然是因为我们要求人类偏好机器人“直立”的轨迹,这被证明是有用的奖励塑造。(在强化学习的奖励函数中也有类似的奖励,以鼓励机器人保持直立,但简单的手工制作奖励并没有那么有用。)
3.1.2 Atari 3.1.2 雅达利
The second set of tasks we consider is a set of seven Atari games in the Arcade Learning Environment (Bellemare et al. 2013), the same games presented in Mnih et al. 2013 我们考虑的第二组任务是一组七个在街机学习环境中的 Atari 游戏(Bellemare 等,2013),与 Mnih 等,2013 中呈现的游戏相同。
Figure 3 shows the results of training our agent with 5,500 queries to a human rater, compared to learning from 350, 700, or 1400 synthetic queries, as well as to RL learning from the real reward. Our method has more difficulty matching RL in these challenging environments, but nevertheless it displays substantial learning on most of them and matches or even exceeds RL on some. Specifically, on BeamRider and Pong, synthetic labels match or come close to RL even with only 3,300 such labels. On Seaquest and Qbert synthetic feedback eventually performs near the level of RL but learns more slowly. On SpaceInvaders and Breakout synthetic feedback never matches RL, but nevertheless the agent improves substantially, often passing the first level in SpaceInvaders and reaching a score of 20 on Breakout, or 50 with enough labels. 图 3 显示了我们用 5500 个查询训练代理的结果,与从 350、700 或 1400 个合成查询学习的结果相比,以及与真实奖励的强化学习(RL)学习的比较。我们的方法在这些具有挑战性的环境中更难以匹配 RL,但尽管如此,它在大多数环境中显示出显著的学习,并在某些情况下与 RL 相匹配或甚至超过 RL。具体而言,在 BeamRider 和 Pong 上,即使只有 3300 个合成标签,合成标签也与 RL 匹配或接近。在 Seaquest 和 Qbert 上,合成反馈最终的表现接近 RL 的水平,但学习速度较慢。在 SpaceInvaders 和 Breakout 上,合成反馈从未与 RL 匹配,但代理仍然显著改善,通常在 SpaceInvaders 中通过第一关,并在 Breakout 中达到 20 分,或者在有足够标签的情况下达到 50 分。
Figure 3: Results on Atari games as measured on the tasks’ true reward. We compare our method using real human feedback (purple), our method using synthetic feedback provided by an oracle (shades of blue), and reinforcement learning using the true reward function (orange). All curves are the average of 3 runs, except for the real human feedback which is a single run, and each point is the average reward over about 150,000 consecutive frames. 图 3:在 Atari 游戏上的结果,以任务的真实奖励为衡量标准。我们比较了使用真实人类反馈(紫色)、使用由神谕提供的合成反馈(蓝色阴影)以及使用真实奖励函数的强化学习(橙色)的方法。所有曲线都是 3 次运行的平均值,除了真实人类反馈是单次运行,每个点是约 150,000 帧连续的平均奖励。
Figure 4: Four frames from a single backflip. The agent is trained to perform a sequence of backflips, landing upright each time. The video is available at this link. 图 4:单次后空翻的四个画面。代理被训练执行一系列后空翻,每次都能直立着着陆。视频可在此链接查看。
On most of the games real human feedback performs similar to or slightly worse than synthetic feedback with the same number of labels, and often comparably to synthetic feedback that has 40%40 \% fewer labels. This may be due to human error in labeling, inconsistency between different contractors labeling the same run, or the uneven rate of labeling by contractors, which can cause labels to be overly concentrated in narrow parts of state space. The latter problems could potentially be addressed by future improvements to the pipeline for outsourcing labels. On Qbert, our method fails to learn to beat the first level with real human feedback; this may be because short clips in Qbert can be confusing and difficult to evaluate. Finally, Enduro is difficult for A3C to learn due to the difficulty of successfully passing other cars through random exploration, and is correspondingly difficult to learn with synthetic labels, but human labelers tend to reward any progress towards passing cars, essentially shaping the reward and thus outperforming A3C in this game (the results are comparable to those achieved with DQN). 在大多数游戏中,真实人类反馈的表现与具有相同标签数量的合成反馈相似或稍差,通常与标签数量减少 40%40 \% 的合成反馈相当。这可能是由于标记中的人为错误、不同承包商对同一运行的标记不一致,或承包商标记的速度不均匀,导致标签在状态空间的狭窄部分过于集中。后者的问题可能通过未来对外包标签流程的改进来解决。在 Qbert 中,我们的方法未能通过真实人类反馈学习击败第一关;这可能是因为 Qbert 中的短片段可能令人困惑且难以评估。最后,由于通过随机探索成功超越其他车辆的难度,Enduro 对 A3C 来说很难学习,并且使用合成标签学习也相应困难,但人类标记者往往会奖励任何朝着超越车辆的进展,从而塑造奖励,因此在这个游戏中超越了 A3C(结果与 DQN 取得的结果相当)。
3.2 Novel behaviors 3.2 新颖行为
Experiments with traditional RL tasks help us understand whether our method is effective, but the ultimate purpose of human interaction is to solve tasks for which no reward function is available. 传统强化学习任务的实验帮助我们理解我们的方法是否有效,但人类互动的最终目的是解决没有奖励函数的任务。
Using the same parameters as in the previous experiments, we show that our algorithm can learn novel complex behaviors. We demonstrate: 使用与之前实验相同的参数,我们展示了我们的算法可以学习新颖的复杂行为。我们证明:
The Hopper robot performing a sequence of backflips (see Figure 4). This behavior was trained using 900 queries in less than an hour. The agent learns to consistently perform a backflip, land upright, and repeat. 霍普机器人执行一系列后空翻(见图 4)。这种行为是在不到一个小时内通过 900 个查询进行训练的。代理学习到能够稳定地完成后空翻,直立着陆并重复。
Figure 5: Performance of our algorithm on MuJoCo tasks after removing various components, as described in Section Section 3.3. All graphs are averaged over 5 runs, using 700 synthetic labels each. 图 5:在去除各个组件后,我们的算法在 MuJoCo 任务上的表现,如第 3.3 节所述。所有图表都是基于 5 次运行的平均值,每次使用 700 个合成标签。
2. The Half-Cheetah robot moving forward while standing on one leg. This behavior was trained using 800 queries in under an hour. 半猎豹机器人在单腿站立的同时向前移动。这个行为是在不到一个小时内使用 800 个查询训练的。
3. Keeping alongside other cars in Enduro. This was trained with roughly 1,300 queries and 4 million frames of interaction with the environment; the agent learns to stay almost exactly even with other moving cars for a substantial fraction of the episode, although it gets confused by changes in background. 3. 在越野赛中与其他汽车保持并行。这是通过大约 1300 个查询和 400 万个与环境互动的帧进行训练的;智能体学习在很大一部分时间内与其他移动汽车几乎保持一致,尽管它会因背景变化而感到困惑。
Videos of these behaviors can be found at this link. These behaviors were trained using feedback from the authors. 这些行为的视频可以在这个链接找到。这些行为是通过作者的反馈进行训练的。
3.3 Ablation Studies 3.3 消融研究
In order to better understand the performance of our algorithm, we consider a range of modifications: 为了更好地理解我们算法的性能,我们考虑了一系列修改:
We pick queries uniformly at random rather than prioritizing queries for which there is disagreement (random queries). 我们随机均匀地选择查询,而不是优先选择存在分歧的查询(随机查询)。
We train only one predictor rather than an ensemble (no ensemble). In this setting, we also choose queries at random, since there is no longer an ensemble that we could use to estimate disagreement. 我们只训练一个预测器,而不是一个集成模型(没有集成)。在这种情况下,我们也随机选择查询,因为不再有集成模型可以用来估计分歧。
We train on queries only gathered at the beginning of training, rather than gathered throughout training (no online queries). 我们只在训练开始时收集的查询上进行训练,而不是在整个训练过程中收集的查询(没有在线查询)。
We remove the ℓ_(2)\ell_{2} regularization and use only dropout (no regularization). 我们去掉 ℓ_(2)\ell_{2} 正则化,仅使用 dropout(没有正则化)。
On the robotics tasks only, we use trajectory segments of length 1 (no segments). 在机器人任务中,我们仅使用长度为 1 的轨迹段(没有段)。
Rather than fitting hat(r)\hat{r} using comparisons, we consider an oracle which provides the true total reward over a trajectory segment, and fit hat(r)\hat{r} to these total rewards using mean squared error (target). 与其通过比较来拟合 hat(r)\hat{r} ,我们考虑一个提供轨迹段真实总奖励的 oracle,并使用均方误差(目标)将 hat(r)\hat{r} 拟合到这些总奖励上。
The results are presented in Figure 5 for MuJoCo and Figure 6for Atari. 结果在图 5 中展示了 MuJoCo,在图 6 中展示了 Atari。
Of particular interest is the poor performance of offline reward predictor training; here we find that due to the nonstationarity of the occupancy distribution, the predictor captures only part of the true reward, and maximizing this partial reward can lead to bizarre behavior that is undesirable as measured by the true reward (Amodei et al. 2016). For instance, on Pong offline training sometimes leads our agent to avoid losing points but not to score points; this can result in extremely long volleys 特别值得关注的是离线奖励预测器训练的糟糕表现;在这里我们发现,由于占用分布的非平稳性,预测器仅捕捉到真实奖励的一部分,最大化这一部分奖励可能导致奇怪的行为,这种行为在真实奖励的衡量下是不可取的(Amodei et al. 2016)。例如,在乒乓球游戏中,离线训练有时会导致我们的代理避免失分,但不去得分;这可能导致极长的对打。
Figure 6: Performance of our algorithm on Atari tasks after removing various components, as described in Section 3.3. All curves are an average of 3 runs using 5,500 synthetic labels (see minor exceptions in Section A.2. 图 6:在去除各种组件后,我们的算法在 Atari 任务上的表现,如第 3.3 节所述。所有曲线是使用 5,500 个合成标签进行 3 次运行的平均值(见附录 A.2 中的小例外)。
that repeat the same sequence of events ad infinitum (videos at this link). This type of behavior demonstrates that in general human feedback needs to be intertwined with RL learning rather than provided statically. 这种行为表明,通常人类反馈需要与强化学习(RL)学习交织在一起,而不是静态提供。
Our main motivation for eliciting comparisons rather than absolute scores was that we found it much easier for humans to provide consistent comparisons than consistent absolute scores, especially on the continuous control tasks and on the qualitative tasks in Section 3.2, nevertheless it seems important to understand how using comparisons affects performance. For continuous control tasks we found that predicting comparisons worked much better than predicting scores. This is likely because the scale of rewards varies substantially and this complicates the regression problem, which is smoothed significantly when we only need to predict comparisons. In the Atari tasks we clipped rewards and effectively only predicted the sign, avoiding these difficulties (this is not a suitable solution for the continuous control tasks because the relative magnitude of the reward are important to learning). In these tasks comparisons and targets had significantly different performance, but neither consistently outperformed the other. 我们主要的动机是引导比较而不是绝对分数,因为我们发现人类提供一致的比较比提供一致的绝对分数要容易得多,特别是在连续控制任务和第 3.2 节的定性任务中,尽管理解使用比较如何影响性能似乎很重要。对于连续控制任务,我们发现预测比较的效果远好于预测分数。这可能是因为奖励的规模变化很大,这使得回归问题变得复杂,而当我们只需要预测比较时,这个问题得到了显著的平滑。在 Atari 任务中,我们截断了奖励,实际上只预测了符号,避免了这些困难(这对于连续控制任务来说不是一个合适的解决方案,因为奖励的相对大小对学习很重要)。在这些任务中,比较和目标的表现有显著差异,但没有一个 consistently 超过另一个。
We also observed large performance differences when using single frames rather than clips 5^(5)5^{5} In order to obtain the same results using single frames we would need to have collected significantly more comparisons. In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame. We found that for short clips it took human raters a while just to understand the situation, while for longer clips the evaluation time was a roughly linear function of the clip length. We tried to choose the shortest clip length for which the evaluation time was linear. In the Atari environments we also found that it was often easier to compare longer clips because they provide more context than single frames. 我们还观察到使用单帧而不是片段时,性能差异很大 5^(5)5^{5} 。为了使用单帧获得相同的结果,我们需要收集显著更多的比较。一般来说,我们发现让人类比较较长的片段在每个片段上要显著更有帮助,而在每帧上则显著不那么有帮助。我们发现,对于短片段,人类评估者需要一些时间才能理解情况,而对于较长的片段,评估时间大致是片段长度的线性函数。我们尝试选择评估时间呈线性的最短片段长度。在 Atari 环境中,我们还发现比较较长的片段通常更容易,因为它们提供了比单帧更多的上下文。
4 Discussion and Conclusions 4 讨论与结论
Agent-environment interactions are often radically cheaper than human interaction. We show that by learning a separate reward model using supervised learning, it is possible to reduce the interaction complexity by roughly 3 orders of magnitude. Not only does this show that we can meaningfully train deep RL agents from human preferences, but also that we are already hitting diminishing returns 代理与环境的交互通常比人类交互便宜得多。我们展示了通过使用监督学习学习一个单独的奖励模型,可以将交互复杂性降低大约三个数量级。这不仅表明我们可以从人类偏好中有意义地训练深度强化学习代理,而且还表明我们已经达到了收益递减的阶段。
on further sample-complexity improvements because the cost of compute is already comparable to the cost of non-expert feedback ^(6){ }^{6} 在进一步的样本复杂性改进上,因为计算成本已经与非专家反馈的成本相当 ^(6){ }^{6}
Although there is a large literature on preference elicitation and reinforcement learning from unknown reward functions, we provide the first evidence that these techniques can be economically scaled up to state-of-the-art reinforcement learning systems. This represents a step towards practical applications of deep RL to complex real-world tasks. 尽管关于偏好引导和从未知奖励函数中进行强化学习的文献很多,但我们提供了第一个证据,表明这些技术可以在经济上扩展到最先进的强化学习系统。这标志着深度强化学习在复杂现实任务中的实际应用迈出了重要一步。
Future work may be able to improve the efficiency of learning from human preferences, and expand the range of tasks to which it can be applied. 未来的工作可能会提高从人类偏好中学习的效率,并扩大其适用的任务范围。
In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied in the service of complex human values rather than low-complexity goals. 从长远来看,使学习人类偏好的任务不比从程序化奖励信号中学习更困难是理想的,这样可以确保强大的强化学习系统能够服务于复杂的人类价值观,而不是低复杂度的目标。
Acknowledgments 致谢
We thank Olivier Pietquin, Bilal Piot, Laurent Orseau, Pedro Ortega, Victoria Krakovna, Owain Evans, Andrej Karpathy, Igor Mordatch, and Jack Clark for reading drafts of the paper. We thank Tyler Adkisson, Mandy Beri, Jessica Richards, Heather Tran, and other contractors for providing the data used to train our agents. Finally, we thank OpenAI and DeepMind for providing a supportive research environment and for supporting and encouraging this collaboration. 我们感谢 Olivier Pietquin、Bilal Piot、Laurent Orseau、Pedro Ortega、Victoria Krakovna、Owain Evans、Andrej Karpathy、Igor Mordatch 和 Jack Clark 阅读论文草稿。我们感谢 Tyler Adkisson、Mandy Beri、Jessica Richards、Heather Tran 以及其他承包商提供用于训练我们代理的数据。最后,我们感谢 OpenAI 和 DeepMind 提供支持性的研究环境,并支持和鼓励这次合作。
References 参考文献
Martin Abadi et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. 马丁·阿巴迪等。Tensorflow:在异构分布式系统上进行大规模机器学习。arXiv 预印本 arXiv:1603.04467,2016。
Riad Akrour, Marc Schoenauer, and Michele Sebag. Preference-based policy learning. Machine learning and knowledge discovery in databases, pages 12-27, 2011. 里亚德·阿克鲁尔,马克·肖诺,米歇尔·塞巴格。基于偏好的政策学习。《机器学习与数据库中的知识发现》,第 12-27 页,2011 年。
Riad Akrour, Marc Schoenauer, and Michèle Sebag. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116-131, 2012. Riad Akrour, Marc Schoenauer 和 Michèle Sebag. 四月:基于主动偏好学习的强化学习. 在欧洲机器学习与数据库知识发现联合会议上, 第 116-131 页, 2012 年.
Riad Akrour, Marc Schoenauer, Michèle Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, pages 1503-1511, 2014. 里亚德·阿克鲁尔,马克·舍诺,米歇尔·塞巴格,和让-克里斯托夫·苏普莱。通过反馈编程。在国际机器学习会议,页码 1503-1511,2014 年。
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016. 达里奥·阿莫代伊,克里斯·奥拉赫,雅各布·斯坦哈特,保罗·克里斯蒂亚诺,约翰·舒尔曼,丹·马内。人工智能安全中的具体问题。arXiv 预印本 arXiv:1606.06565,2016。
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253-279,201347: 253-279,2013. 马克·G·贝尔马尔,雅瓦尔·纳达夫,乔尔·维尼斯和迈克尔·博林。《街机学习环境:通用智能体的评估平台》。人工智能研究杂志, 47:253-279,201347: 253-279,2013 。
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014. 尼克·博斯特罗姆。《超级智能:路径、危险、策略》。牛津大学出版社,2014 年。
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324-345, 1952. 拉尔夫·艾伦·布拉德利和米尔顿·E·特里。 不完全区组设计的等级分析:I. 配对比较法。 生物统计学,39(3/4):324-345,1952。
Eric Brochu, Tyson Brochu, and Nando de Freitas. A bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 103-112. Eurographics Association, 2010. 埃里克·布罗丘,泰森·布罗丘,南多·德·弗雷塔斯。基于贝叶斯的交互式优化方法用于程序动画设计。载于 2010 年 ACM SIGGRAPH/欧洲图形学计算机动画研讨会论文集,页码 103-112。欧洲图形学协会,2010 年。
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016. 格雷格·布罗克曼,维基·张,路德维希·佩特松,乔纳斯·施奈德,约翰·舒尔曼,唐杰,和沃伊切赫·扎伦巴。OpenAI Gym。arXiv 预印本 arXiv:1606.01540,2016。
Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Robotics: Science and Systems, 2014. 克里斯蒂安·丹尼尔,马尔特·维林,扬·梅茨,奥利弗·克罗默,扬·彼得斯。主动奖励学习。在《机器人:科学与系统》,2014 年。
Christian Daniel, Oliver Kroemer, Malte Viering, Jan Metz, and Jan Peters. Active reward learning with a novel acquisition function. Autonomous Robots, 39(3):389-405, 2015. 克里斯蒂安·丹尼尔,奥利弗·克罗默,马尔特·维林,扬·梅茨,扬·彼得斯。使用新型获取函数的主动奖励学习。《自主机器人》,39(3):389-405,2015。
Layla El Asri, Bilal Piot, Matthieu Geist, Romain Laroche, and Olivier Pietquin. Score-based inverse reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 457-465,2016457-465,2016. 莱拉·埃尔·阿斯里,比拉尔·皮奥特,马蒂厄·盖斯特,罗曼·拉罗什,和奥利维尔·皮特昆。基于评分的逆强化学习。在国际自主代理与多智能体系统会议上,页码 457-465,2016457-465,2016 。
Arpad Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978. 阿尔帕德·埃洛。《棋手的评级,过去与现在》。阿尔科出版社,1978 年。
Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, volume 48, 2016. 切尔西·芬恩,谢尔盖·莱文,和皮特·阿贝尔。引导成本学习:通过策略优化的深度逆最优控制。在国际机器学习会议,卷 48,2016 年。
Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. In International Conference on Learning Representations, 2017. 切尔西·芬,田禾·余,贾斯廷·傅,皮特·阿贝尔,和谢尔盖·莱文。通过半监督强化学习推广技能。在国际学习表征会议,2017 年。
Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123-156, 2012. 约翰内斯·弗尔克兰茨,艾克·赫勒迈尔,程伟伟,朴相贤。基于偏好的强化学习:一个正式框架和一个策略迭代算法。《机器学习》,89(1-2):123-156,2012。
Dylan Hadfield-Menell, Stuart Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 3909-3917, 2016. 迪伦·哈德菲尔德-梅内尔,斯图尔特·拉塞尔,皮特·阿贝尔,安卡·德拉甘。合作逆强化学习。载于《神经信息处理系统进展》,第 3909-3917 页,2016 年。
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z Leibo, and Audrunas Gruslys. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017. 托德·赫斯特,马泰·维切里克,奥利维耶·皮特昆,马克·兰克托,汤姆·肖尔,比拉尔·皮奥特,安德鲁·森多纳里斯,加布里埃尔·杜拉克-阿诺德,伊恩·奥斯班德,约翰·阿加皮乌,乔尔·Z·莱博,和奥德鲁纳斯·格鲁斯利斯。通过示范学习进行现实世界的强化学习。arXiv 预印本 arXiv:1704.03732,2017 年。
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565-4573, 2016. 乔纳森·霍和斯特凡诺·埃尔蒙。生成对抗模仿学习。在《神经信息处理系统进展》,第 4565-4573 页,2016 年。
W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In International Conference on Knowledge Capture, pages 9-16, 2009. W·布拉德利·诺克斯和彼得·斯通。通过人类强化互动塑造智能体:TAMER 框架。在国际知识捕获会议,页码 9-16,2009 年。
W. Bradley Knox and Peter Stone. Learning non-myopically from human-generated reward. In Jihie Kim, Jeffrey Nichols, and Pedro A. Szekely, editors, IUI, pages 191-202. ACM, 2013. ISBN 978-1-4503-1965-2. URL http://doi.acm.org/10.1145/2449396. W. Bradley Knox 和 Peter Stone. 从人类生成的奖励中进行非短视学习. 在 Jihie Kim, Jeffrey Nichols 和 Pedro A. Szekely 编辑的 IUI, 第 191-202 页. ACM, 2013. ISBN 978-1-4503-1965-2. URL http://doi.acm.org/10.1145/2449396.
William Bradley Knox. Learning from human-generated reward. PhD thesis, University of Texas at Austin, 2012. 威廉·布拉德利·诺克斯。学习人类生成的奖励。博士论文,德克萨斯大学奥斯汀分校,2012 年。
David Krueger, Jan Leike, Owain Evans, and John Salvatier. Active reinforcement learning: Observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop, 2016. 大卫·克鲁格,简·莱克,欧温·埃文斯,约翰·萨尔瓦提尔。主动强化学习:以成本观察奖励。在《互动学习机器的未来》,NIPS 研讨会,2016 年。
R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005. R Duncan Luce. 个人选择行为:理论分析。快递公司,2005 年。
James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017. 詹姆斯·麦克拉尚,马克·K·霍,罗伯特·洛夫廷,贝·彭,戴维·罗伯茨,马修·E·泰勒,和迈克尔·L·利特曼。来自政策依赖的人类反馈的互动学习。arXiv 预印本 arXiv:1701.06049,2017 年。
AT Machwe and IC Parmee. Introducing machine learning within an interactive evolutionary design environment. In DS 36: Proceedings DESIGN 2006, the 9th International Design Conference, Dubrovnik, Croatia, 2006. AT Machwe 和 IC Parmee。 在交互式进化设计环境中引入机器学习。 在 DS 36:2006 年杜布罗夫尼克第九届国际设计会议论文集,克罗地亚,2006。
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 弗拉基米尔·姆尼赫,科雷·卡武克乔格鲁,大卫·西尔弗,亚历克斯·格雷夫斯,伊奥尼斯·安东诺格鲁,丹·维尔斯特拉,马丁·里德米勒。使用深度强化学习玩雅达利游戏。arXiv 预印本 arXiv:1312.5602,2013 年。
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533, 2015. 弗拉基米尔·姆尼赫,科雷·卡武克丘奥卢,大卫·西尔弗,安德烈·A·鲁苏,乔尔·维内斯,马克·G·贝尔梅尔,亚历克斯·格雷夫斯,马丁·里德米勒,安德烈亚斯·K·菲杰兰德,乔治·奥斯特罗夫斯基,斯蒂格·彼得森,查尔斯·比蒂,阿米尔·萨迪克,伊奥尼斯·安东诺格鲁,海伦·金,达尔尚·库马兰,丹·维尔斯特拉,肖恩·莱格,德米斯·哈萨比斯。通过深度强化学习实现人类水平的控制。《自然》,518(7540):529-533,2015。
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928-1937, 2016. 弗拉基米尔·姆尼赫,阿德里亚·普伊格多梅内赫·巴迪亚,梅赫迪·米尔扎,亚历克斯·格雷夫斯,蒂莫西·利利克拉普,蒂姆·哈利,戴维·西尔弗,和科拉伊·卡武克丘。深度强化学习的异步方法。在国际机器学习会议,页码 1928-1937,2016 年。
Andrew Y Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine learning, pages 663-670, 2000. 安德鲁·Y·吴和斯图尔特·拉塞尔。逆强化学习的算法。在国际机器学习会议,663-670 页,2000 年。
Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In International Conference on Rehabilitation Robotics, pages 1-7, 2011. 帕特里克·M·皮拉斯基,迈克尔·R·道森,托马斯·德格里斯,法尔博德·法希米,杰森·P·凯里,理查德·萨顿。通过演员-评论家强化学习在线训练肌电假肢控制器。在国际康复机器人会议,页码 1-7,2011 年。
Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58, 2016. 斯图尔特·拉塞尔。我们应该害怕超级智能机器人吗?《科学美国人》,314(6):58,2016。
John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889-1897, 2015. 约翰·舒尔曼,谢尔盖·莱文,皮特·阿贝尔,迈克尔·I·乔丹,和菲利普·莫里茨。信任区域策略优化。在国际机器学习会议,页码 1889-1897,2015 年。
Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Campbell, and Kenneth O Stanley. Picbreeder: Evolving pictures collaboratively online. In Conference on Human Factors in Computing Systems, pages 1759-1768, 2008. 吉米·西克坦,尼古拉斯·比亚托,大卫·B·D·安布罗西奥,阿德琳·罗德里格斯,亚当·坎贝尔,肯尼斯·O·斯坦利。Picbreeder:在线协作进化图片。在计算机系统人因会议,页码 1759-1768,2008 年。
Roger N Shepard. Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space. Psychometrika, 22(4):325-345, 1957. 罗杰·N·谢泼德。刺激与反应的泛化:一个将泛化与心理空间中的距离相关联的随机模型。《心理计量学》,22(4):325-345,1957。
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489, 2016. 大卫·西尔弗,阿贾·黄,克里斯·J·马迪森,亚瑟·古兹,洛朗·西弗,乔治·范登·德里斯切,朱利安·施特维泽,伊奥尼斯·安东诺格鲁,维达·帕内尔舍尔瓦姆,马克·兰克托,桑德·迪尔曼,多米尼克·格雷韦,约翰·南,纳尔·卡尔赫布伦纳,伊利亚·苏茨克弗,蒂莫西·利利克拉普,马德琳·利奇,科雷·卡武克朱奥格,托雷·格雷佩尔,德米斯·哈萨比斯。通过深度神经网络和树搜索掌握围棋游戏。《自然》,529(7587):484-489,2016。
Patrikk D Sørensen, Jeppeh M Olsen, and Sebastian Risi. Breeding a diversity of super mario behaviors through interactive evolution. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pages 1-7. IEEE, 2016. 帕特里克·D·索伦森,耶佩·M·奥尔森,和塞巴斯蒂安·里西。通过互动进化培育多样化的超级马里奥行为。在 2016 年 IEEE 计算智能与游戏会议(CIG)上,页码 1-7。IEEE,2016 年。
Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. In International Conference on Learning Representations, 2017. 布拉德利·C·斯塔迪、皮特·阿贝尔和伊利亚·苏茨克维尔。第三人称模仿学习。在国际学习表征会议,2017 年。
Hiroaki Sugiyama, Toyomi Meguro, and Yasuhiro Minami. Preference-learning based inverse reinforcement learning for dialog control. In INTERSPEECH, pages 222-225, 2012. 杉山弘明,目黑丰美,南康宏。基于偏好学习的对话控制逆强化学习。在 INTERSPEECH,222-225 页,2012 年。
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pages 5026-5033, 2012. 埃马纽尔·托多罗夫,汤姆·埃雷兹,和尤瓦尔·塔萨。Mujoco:一种用于基于模型控制的物理引擎。在国际智能机器人与系统会议上,页码 5026-5033,2012 年。
Sida I Wang, Percy Liang, and Christopher D Manning. Learning language games through interaction. arXiv preprint arXiv:1606.02447, 2016. Sida I Wang, Percy Liang 和 Christopher D Manning. 通过互动学习语言游戏. arXiv 预印本 arXiv:1606.02447, 2016.
Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, pages 1133-1141, 2012. 亚伦·威尔逊,阿兰·费恩,普拉萨德·塔德帕利。基于贝叶斯的方法从轨迹偏好查询中学习策略。在《神经信息处理系统进展》中,页码 1133-1141,2012 年。
Christian Wirth and Johannes Fürnkranz. Preference-based reinforcement learning: A preliminary survey. In ECML/PKDD Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards, 2013. 克里斯蒂安·维尔特和约翰内斯·弗尔克兰茨。基于偏好的强化学习:初步调查。在 2013 年 ECML/PKDD 关于从广义反馈中进行强化学习的研讨会:超越数值奖励。
Christian Wirth, J Fürnkranz, Gerhard Neumann, et al. Model-free preference-based reinforcement learning. In AAAIA A A I, pages 2222-2228,20162222-2228,2016. 克里斯蒂安·维尔特,J·弗尔克兰茨,格哈德·诺伊曼等。无模型的基于偏好的强化学习。在 AAAIA A A I ,第 2222-2228,20162222-2228,2016 页。
A Experimental Details 实验细节
Many RL environments have termination conditions that depend on the behavior of the agent, such as ending an episode when the agent dies or falls over. We found that such termination conditions encode information about the task even when the reward function is not observable. To avoid this subtle source of supervision, which could potentially confound our attempts to learn from human preferences only, we removed all variable-length episodes: 许多强化学习环境的终止条件依赖于智能体的行为,例如当智能体死亡或摔倒时结束一个回合。我们发现,即使奖励函数不可观察,这种终止条件也编码了关于任务的信息。为了避免这种微妙的监督来源,这可能会干扰我们仅从人类偏好中学习的尝试,我们移除了所有可变长度的回合:
In the Gym versions of our robotics tasks, the episode ends when certain parameters go outside of a prescribed range (for example when the robot falls over). We replaced these termination conditions by a penalty which encourages the parameters to remain in the range (and which the agent must learn). 在我们机器人任务的健身房版本中,当某些参数超出规定范围时(例如,当机器人摔倒时),剧集结束。我们用一种惩罚机制替代了这些终止条件,鼓励参数保持在范围内(而且代理必须学习这一点)。
In Atari games, we do not send life loss or episode end signals to the agent (we do continue to actually reset the environment), effectively converting the environment into a single continuous episode. When providing synthetic oracle feedback we replace episode ends with a penalty in all games except Pong; the agent must learn this penalty. 在雅达利游戏中,我们不向智能体发送生命损失或关卡结束信号(我们确实会重置环境),实际上将环境转换为一个连续的单一关卡。在提供合成神谕反馈时,我们在所有游戏中用惩罚替代关卡结束,除了乒乓球;智能体必须学习这个惩罚。
Removing variable length episodes leaves the agent with only the information encoded in the environment itself; human feedback provides its only guidance about what it ought to do. 去除可变长度的情节后,智能体仅剩下环境本身编码的信息;人类反馈是它关于应做什么的唯一指导。
At the beginning of training we compare a number of trajectory segments drawn from rollouts of an untrained (randomly initialized) policy. In the Atari domain we also pretrain the reward predictor for 200 epochs before beginning RL training, to reduce the likelihood of irreversibly learning a bad policy based on an untrained predictor. For the rest of training, labels are fed in at a rate decaying inversely with the number of timesteps; after twice as many timesteps have elapsed, we answer about half as many queries per unit time. The details of this schedule are described in each section. This “label annealing” allows us to balance the importance of having a good predictor from the start with the need to adapt the predictor as the RL agent learns and encounters new states. When training with real human feedback, we attempt to similarly anneal the label rate, although in practice this is approximate because contractors give feedback at uneven rates. 在训练开始时,我们比较从未训练(随机初始化)策略的回放中提取的多个轨迹片段。在 Atari 领域,我们还在开始强化学习训练之前预训练奖励预测器 200 个周期,以减少基于未训练预测器不可逆地学习到糟糕策略的可能性。在训练的其余部分,标签以与时间步数成反比的速率输入;经过两倍的时间步数后,我们每单位时间回答的查询大约减少一半。这个调度的细节在每个部分中都有描述。这种“标签退火”使我们能够平衡从一开始就拥有良好预测器的重要性与在强化学习代理学习和遇到新状态时调整预测器的需要。当使用真实人类反馈进行训练时,我们也尝试以类似的方式退火标签速率,尽管在实践中这是近似的,因为承包商提供反馈的速率不均匀。
Except where otherwise stated we use an ensemble of 3 predictors, and draw a factor 10 more clip pair candidates than we ultimately present to the human, with the presented clips being selected via maximum variance between the different predictors as described in Section 2.2.4 除非另有说明,我们使用 3 个预测器的集合,并且生成的剪辑对候选数量是最终呈现给人类的数量的 10 倍,所呈现的剪辑是通过不同预测器之间的最大方差进行选择的,如第 2.2.4 节所述。
A. 1 Simulated Robotics Tasks A. 1 模拟机器人任务
The OpenAI Gym continuous control tasks penalize large torques. Because torques are not directly visible to a human supervisor, these reward functions are not good representatives of human preferences over trajectories and so we removed them. OpenAI Gym 的连续控制任务对大扭矩进行惩罚。由于扭矩对人类监督者来说并不可直接观察,因此这些奖励函数并不能很好地代表人类对轨迹的偏好,因此我们将其移除。
For the simulated robotics tasks, we optimize policies using trust region policy optimization (TRPO, Schulman et al. 2015) with discount rate gamma=0.995\gamma=0.995 and lambda=0.97\lambda=0.97. The reward predictor is a twolayer neural network with 64 hidden units each, using leaky ReLUs (alpha=0.01)(\alpha=0.01) as nonlinearities ^(7){ }^{7} We compare trajectory segments that last 1.5 seconds, which varies from 15 to 60 timesteps depending on the task. 对于模拟机器人任务,我们使用信任区域策略优化(TRPO,Schulman 等,2015)优化策略,折扣率为 gamma=0.995\gamma=0.995 和 lambda=0.97\lambda=0.97 。奖励预测器是一个具有 64 个隐藏单元的两层神经网络,使用泄漏 ReLU (alpha=0.01)(\alpha=0.01) 作为非线性激活函数 ^(7){ }^{7} 。我们比较持续 1.5 秒的轨迹段,具体时间步长根据任务从 15 到 60 不等。
We normalize the reward predictions to have standard deviation 1. When learning from the reward predictor, we add an entropy bonus of 0.01 on all tasks except swimmer, where we use an entropy bonus of 0.001 . As noted in Section 2.2.1, this entropy bonus helps to incentivize the increased exploration needed to deal with a changing reward function. 我们将奖励预测标准化为标准差为 1。在从奖励预测器学习时,除了游泳者任务外,我们在所有任务上添加 0.01 的熵奖励,而在游泳者任务中使用 0.001 的熵奖励。如第 2.2.1 节所述,这个熵奖励有助于激励增加探索,以应对变化的奖励函数。
We collect 25%25 \% of our comparisons from a randomly initialized policy network at the beginning of training, and our rate of labeling after TT frames 2**10^(6)//(T+2**10^(6))2 * 10^{6} /\left(T+2 * 10^{6}\right). 我们在训练开始时从随机初始化的策略网络中收集 25%25 \% 的比较,并且在 TT 帧后我们的标记速率为 2**10^(6)//(T+2**10^(6))2 * 10^{6} /\left(T+2 * 10^{6}\right) 。
A. 2 Atari A. 2 雅达利
Our Atari agents are trained using the standard set of environment wrappers used by Mnih et al. (2015): 0 to 30 no-ops in the beginning of an episode, max-pooling over adjacent frames, stacking of 4 frames, a frameskip of 4 , life loss ending an episode (but not resetting the environment), and rewards clipped to [-1,1][-1,1]. 我们的 Atari 代理使用 Mnih 等人(2015)所使用的标准环境包装器进行训练:在每个回合开始时进行 0 到 30 次无操作,最大池化相邻帧,堆叠 4 帧,帧跳过为 4,生命损失结束回合(但不重置环境),奖励被限制在 [-1,1][-1,1] 。
Atari games include a visual display of the score, which in theory could be used to trivially infer the reward. Since we want to focus instead on inferring the reward from the complex dynamics happening in the game, we replace the score area with a constant black background on all seven games. On BeamRider we additionally blank out the enemy ship count, and on Enduro we blank out the speedometer. 雅达利游戏包括分数的视觉显示,理论上可以用来轻松推断奖励。由于我们希望专注于从游戏中发生的复杂动态推断奖励,因此我们在所有七个游戏中将分数区域替换为一个恒定的黑色背景。在《光束骑士》中,我们还遮蔽了敌舰数量,而在《耐力赛》中,我们遮蔽了速度计。
For the Atari tasks we optimize policies using the A3C algorithm (Mnih et al. 2016) in synchronous form (A2C), with policy architecture as described in Mnih et al. (2015). We use standard settings for the hyperparameters: an entropy bonus of beta=0.01\beta=0.01, learning rate of 0.0007 decayed linearly to reach zero after 80 million timesteps (although runs were actually trained for only 50 million timesteps), n=5n=5 steps per update, N=16N=16 parallel workers, discount rate gamma=0.99\gamma=0.99, and policy gradient using Adam with alpha=0.99\alpha=0.99 and epsilon=10^(-5)\epsilon=10^{-5}. 对于 Atari 任务,我们使用 A3C 算法(Mnih 等,2016)以同步形式(A2C)优化策略,策略架构如 Mnih 等(2015)所述。我们使用超参数的标准设置:熵奖励为 beta=0.01\beta=0.01 ,学习率为 0.0007,线性衰减至 80 百万时间步后达到零(尽管实际训练仅进行了 50 百万时间步),每次更新 n=5n=5 步, N=16N=16 个并行工作者,折扣率 gamma=0.99\gamma=0.99 ,使用 Adam 的策略梯度,参数为 alpha=0.99\alpha=0.99 和 epsilon=10^(-5)\epsilon=10^{-5} 。
For the reward predictor, we use 84 x84 images as inputs (the same as the inputs to the policy), and stack 4 frames for a total 84 xx84 xx484 \times 84 \times 4 input tensor. This input is fed through 4 convolutional layers of size 7xx7,5xx5,3xx37 \times 7,5 \times 5,3 \times 3, and 3xx33 \times 3 with strides 3,2,1,13,2,1,1, each having 16 filters, with leaky ReLU nonlinearities (alpha=0.01)(\alpha=0.01). This is followed by a fully connected layer of size 64 and then a scalar output. All convolutional layers use batch norm and dropout with alpha=0.5\alpha=0.5 to prevent predictor overfitting. In addition we use ℓ_(2)\ell_{2} regularization with the adapative scheme described in Section 2.2.3 Since the reward predictor is ultimately used to compare two sums over timesteps, its scale is arbitrary, and we normalize it to have a standard deviation of 0.05 (we could equivalently have adjusted our learning rates and entropy bonus, but this choice allowed us to use the same parameters as for the real reward function). 对于奖励预测器,我们使用 84 x 84 的图像作为输入(与策略的输入相同),并堆叠 4 帧以形成总 84 xx84 xx484 \times 84 \times 4 输入张量。该输入通过 4 个卷积层,大小为 7xx7,5xx5,3xx37 \times 7,5 \times 5,3 \times 3 和 3xx33 \times 3 ,步幅为 3,2,1,13,2,1,1 ,每个层有 16 个滤波器,使用泄漏 ReLU 非线性激活函数 (alpha=0.01)(\alpha=0.01) 。接下来是一个大小为 64 的全连接层,然后是一个标量输出。所有卷积层都使用批量归一化和丢弃法,丢弃率为 alpha=0.5\alpha=0.5 ,以防止预测器过拟合。此外,我们使用 ℓ_(2)\ell_{2} 正则化,采用第 2.2.3 节中描述的自适应方案。由于奖励预测器最终用于比较时间步长上的两个总和,因此其尺度是任意的,我们将其标准化为标准差为 0.05(我们也可以相应地调整学习率和熵奖励,但这个选择使我们能够使用与真实奖励函数相同的参数)。
We compare trajectory segments of 25 timesteps ( 1.7 seconds at 15 fps with frame skipping). 我们比较 25 个时间步长(在 15 帧每秒下跳帧,1.7 秒)的轨迹段。
We collect 500 comparisons from a randomly initialized policy network at the beginning of training, and our rate of labeling after TT frames of training is decreased every 5**10^(6)5 * 10^{6} frames, to be roughly proportional to 5**10^(6)//(T+5**10^(6))5 * 10^{6} /\left(T+5 * 10^{6}\right). 我们在训练开始时从随机初始化的策略网络中收集 500 个比较,并且在训练后的 TT 帧中我们的标记速率每 5**10^(6)5 * 10^{6} 帧减少,基本上与 5**10^(6)//(T+5**10^(6))5 * 10^{6} /\left(T+5 * 10^{6}\right) 成正比。
The predictor is trained asynchronously from the RL agent, and on our hardware typically processes 1 label per 10 RL timesteps. We maintain a buffer of only the last 3,000 labels and loop over this buffer continuously; this is to ensure that the predictor gives enough weight to new labels (which can represent a shift in distribution) when the total number of labels becomes large. 预测器与 RL 代理异步训练,并且在我们的硬件上通常每 10 个 RL 时间步处理 1 个标签。我们仅维护最后 3,000 个标签的缓冲区,并持续循环这个缓冲区;这样可以确保当标签总数变大时,预测器对新标签(可能代表分布的变化)给予足够的权重。
In the ablation studies of Figure 5b, pretraining has 5,000 labels rather than 5,500, and the “target” beamrider curve is averaged over 2 runs rather than 3. 在图 5b 的消融研究中,预训练有 5000 个标签而不是 5500 个,并且“目标”光束骑手曲线是基于 2 次运行的平均值,而不是 3 次。
B Instructions Provided to Contractors B 提供给承包商的指示
B. 1 MuJoCo
Giving feedback 反馈
Sign up for a slot in the spreadsheet. Then go to the appropriate URL’s that we give you, and you’ll be repeatedly presented with two video clips of an AI controlling a virtual robot. 在电子表格中注册一个名额。然后访问我们提供的相应网址,您将不断看到两个视频片段,展示一个人工智能控制虚拟机器人。
Look at the clips and select the one in which better things happen. Only decide on events you actually witness in the clip. 查看这些片段,选择其中发生更好事情的那一个。只对你在片段中实际目睹的事件做出决定。
Here's a guide on what constitutes good and bad behavior in each specific domain: 这是关于每个特定领域中什么是良好和不良行为的指南:
Hopper: the “center” of the robot is the joint closest to the pointy end. The first priority is for the center of the robot to move to the right (moving to the left is worse than not moving at all). If the two robots are roughly tied on this metric, then the tiebreaker is how high the center is. 霍普:机器人的“中心”是离尖端最近的关节。首要任务是让机器人的中心向右移动(向左移动比不动更糟)。如果两个机器人在这个指标上大致相当,那么决定胜负的标准是中心的高度。
Walker: the “center” of the robot is the joint where the three limbs meet. The first priority is for the center of the robot to move to the right. If the two robots are roughly tied on this metric, then the tiebreaker is how high the center is. 行走者:机器人的“中心”是三条肢体相交的关节。首要任务是让机器人的中心向右移动。如果两个机器人在这个指标上大致相当,那么决定胜负的标准是中心的高度。
Swimmer: the “center” of the robot is the mark in the middle of its body. The center should move to the right as fast as possible. 游泳者:机器人的“中心”是其身体中间的标记。中心应该尽可能快地向右移动。
Cheetah: the robot should move to the right as fast as possible. 猎豹:机器人应该尽快向右移动。
Ant: the first priority is for the robot to be standing upright, and failing that for the center of the robot to be as high up as possible. If both robots are upright or neither is, the tie breaker is whichever one is moving faster to the right. 蚂蚁:首要任务是让机器人直立,如果无法做到,则要尽可能让机器人的中心高度尽量高。如果两个机器人都是直立的,或者两个机器人都不是直立的,则以向右移动更快的机器人作为决胜因素。
Reacher: the green dot on the robot arm should be as close as possible to the red dot. Being near for a while and far for a while is worse than being at an intermediate distance for the entire clip. 伸手:机器人手臂上的绿色点应该尽可能靠近红色点。靠近一段时间然后远离一段时间,比在整个片段中保持中间距离要糟糕得多。
Pendulum: the pendulum should be pointing approximately up. There will be a lot of ties where the pendulum has fallen and a lot of “can’t tells” where it is off the side of the screen. If you can see one pendulum and it hasn’t fallen down, that’s better than being unable to see the other pendulum. 摆:摆应该大致指向上方。会有很多摆落下的情况,以及很多“无法判断”的情况,因为它在屏幕的边缘。如果你能看到一个摆并且它没有掉下来,那比看不到另一个摆要好。
Double-pendulum: both pendulums should be pointing approximately up (if they fall down, the cart should try to swing them back up) and the cart should be near the center of the track. Being high for a while and low for a while is worse than being at an intermediate distance the entire time. 双摆:两个摆应该大致指向上方(如果它们下落,推车应该试图将它们摆回上方),推车应该靠近轨道的中心。高一段时间和低一段时间比始终处于中间距离要糟糕。
If both clips look about the same to you, then click “tie”. If you don’t understand what’s going on in the clip or find it hard to evaluate, then click “can’t tell”. 如果这两个片段在你看来差不多,那么点击“平局”。如果你不明白片段中的内容或觉得很难评估,那么点击“无法判断”。
You can speed up your feedback by using the arrow keys 您可以通过使用箭头键来加快反馈速度
left and right select clips, up is a tie, down is “can’t tell”. 左右选择剪辑,上是平局,下是“无法判断”。
FAQ 常见问题解答
I got an error saying that we’re out of clips. What’s up? Occasionally the server may run out of clips to give you, and you’ll see an error message. This is normal, just wait a minute and refresh the page. If you don’t get clips for more than a couple minutes, please ping @ tom on slack. 我收到一个错误提示,说我们没有剪辑了。怎么回事?偶尔服务器可能会没有剪辑可供你使用,你会看到一个错误信息。这是正常的,只需等一分钟并刷新页面。如果你超过几分钟没有获得剪辑,请在 Slack 上联系@tom。
Do I need to start right at the time listed in the spreadsheet? Starting 10 minutes before or after the listed time is fine. 我需要在电子表格中列出的时间准时开始吗?提前或推迟 10 分钟开始都可以。
B. 2 Atari B. 2 雅达利
In this task you’ll be trying to teach an AI to play Atari games by giving it feedback on how well it is playing. 在这个任务中,您将尝试通过反馈 AI 的游戏表现来教它玩雅达利游戏。
IMPORTANT. First play the game yourself for 5 minutes 重要。首先自己玩这个游戏 5 分钟。
Before providing feedback to the AI, play the game yourself for a five minutes to get a sense of how it works. It’s often hard to tell what the game is about just by looking at short clips, especially if you’ve never played it before. 在给人工智能提供反馈之前,先自己玩五分钟游戏,以了解它是如何运作的。仅通过观看短片很难判断游戏的内容,尤其是如果你从未玩过它。
Play the game online for 5 minutes ^(8){ }^{8} You’ll need to press F12 or click the GAME RESET button to start the game. Then set a timer for 5 minutes and explore the game to see how it works. 在线玩游戏 5 分钟 ^(8){ }^{8} 你需要按 F12 或点击游戏重置按钮来开始游戏。然后设置一个 5 分钟的计时器,探索游戏以了解它是如何运作的。
Giving feedback 反馈
Sign up for a slot in the spreadsheet. Then go to the appropriate URL’s that we give you, and you’ll be repeatedly presented with two video clips of an AI playing the game. 在电子表格中注册一个名额。然后访问我们提供的相应网址,您将不断看到两个视频片段,展示一个人工智能在玩游戏。
Look at the clips and select the one in which better things happen. For example, if the left clip shows the AI shooting an enemy ship while the right clip shows it being shot by an enemy ship, then better things happen in the left clip and thus the left clip is better. Only decide on actions you actually witness in the clip. 查看这些片段,选择其中发生更好事情的那一个。例如,如果左侧片段显示人工智能击毁了一艘敌舰,而右侧片段显示它被敌舰击中,那么左侧片段发生了更好的事情,因此左侧片段更好。只对你在片段中实际看到的行为做出判断。
Here's a guide on what constitutes good and bad play in each specific game: 这是关于每个特定游戏中什么构成良好和糟糕游戏的指南:
BeamRider: shoot enemy ships (good), and don’t get shot (very bad) BeamRider:击毁敌舰(好),不要被击中(非常糟糕)
Breakout: hit the ball with the paddle, break the colored blocks, and don’t let the ball fall off the bottom of the screen 突破:用球拍击打球,打破彩色方块,不要让球掉到屏幕底部
Enduro: pass as many cars as you can, and don’t get passed by cars 越野赛:尽可能超越更多的汽车,别被汽车超越
Pong: knock the ball past the opponent’s orange paddle on the left (good), and don’t let it go past your green paddle on the right (bad) 乒乓球:将球击打到对手左侧的橙色球拍(好),不要让球越过你右侧的绿色球拍(坏)
Qbert: change the color of as many blocks as you can (good), but don’t jump off the side or run into enemies (very bad) Qbert:尽可能改变尽可能多的方块颜色(很好),但不要跳下边缘或撞到敌人(非常糟糕)
SpaceInvaders: shoot enemy ships (good), and don’t let your ship (the one at the bottom of the screen) get shot (very bad) 太空入侵者:射击敌舰(好),不要让你的飞船(屏幕底部的那艘)被击中(非常糟糕)
SeaQuest: Shoot the fish and enemy submarines (good) and pick up the scuba divers. Don’t let your submarine run out of air or get hit by a fish or torpedo (very bad) 海洋探索:射击鱼和敌方潜艇(好),并捡起水肺潜水员。不要让你的潜艇缺氧或被鱼或鱼雷击中(非常糟糕)
Enduro (even mode): Avoid passing cars OR getting passed by them, you want to stay even with other cars (not having any around is OK too) 越野(均衡模式):避免超越其他车辆或被超越,您希望与其他车辆保持平行(周围没有车辆也可以)
Don’t worry about how the agent got into the situation it is in (for instance, it doesn’t matter if one agent has more lives, or is now on a more advanced level); just focus on what happens in the clip itself. 不要担心代理是如何进入当前情况的(例如,一个代理是否有更多的生命,或者现在处于更高级的水平并不重要);只需关注片段本身发生的事情。
If both clips look about the same to you, then click “tie”. If you don’t understand what’s going on in the clip or find it hard to evaluate, then click “can’t tell”. Try to minimize responding “can’t tell” unless you truly are confused. 如果这两个片段在你看来差不多,那么点击“平局”。如果你不明白片段中的内容或觉得很难评估,那么点击“无法判断”。尽量减少点击“无法判断”,除非你真的感到困惑。
You can speed up your feedback by using the arrow keys 您可以通过使用箭头键来加快反馈速度
left and right select clips, up is a tie, down is “can’t tell”. 左右选择剪辑,上是平局,下是“无法判断”。
FAQ 常见问题解答
I got an error saying that we’re out of clips. What’s up? Occasionally the server may run out of clips to give you, and you’ll see an error message. This is normal, just wait a minute and refresh the page. If you don’t get clips for more than a couple minutes, please ping @ tom on slack. 我收到一个错误提示,说我们没有剪辑了。怎么回事?偶尔服务器可能会没有剪辑可供你使用,你会看到一个错误信息。这是正常的,只需等一分钟并刷新页面。如果你超过几分钟没有获得剪辑,请在 Slack 上联系@tom。
If the agent is already dead when the clip starts, how should I compare it? If the clip is after getting killed (but not showing the dying), then its performance during the clip is neither good nor bad. You can treat it as purely average play. If you see it die, or it’s possible that it contains a frame of it dying, then it’s definitely bad. 如果代理在片段开始时已经死了,我该如何比较?如果片段是在被杀之后(但没有显示死亡),那么它在片段中的表现既不好也不坏。你可以把它视为纯粹的平均表现。如果你看到它死了,或者可能包含它死亡的画面,那么这肯定是糟糕的。
Do I need to start right at the time listed in the spreadsheet? Starting 30 minutes before or after the listed time is fine. 我需要在电子表格中列出的时间准时开始吗?提前或推迟 30 分钟开始都可以。
^(1){ }^{1} Here we assume here that the reward is a function of the observation and action. In our experiments in Atari environments, we instead assume the reward is a function of the preceding 4 observations. In a general partially observable environment, we could instead consider reward functions that depend on the whole sequence of observations, and model this reward function with a recurrent neural network. 在这里,我们假设奖励是观察和行动的函数。在我们在 Atari 环境中的实验中,我们则假设奖励是前 4 个观察的函数。在一般的部分可观察环境中,我们可以考虑依赖于整个观察序列的奖励函数,并用递归神经网络对该奖励函数进行建模。 ^(2){ }^{2} Wilson et al. 2012) also assumes the ability to sample reasonable initial states. But we work with high dimensional state spaces for which random states will not be reachable and the intended policy inhabits a low-dimensional manifold. ^(2){ }^{2} 威尔逊等人(2012)也假设能够采样合理的初始状态。但我们处理的是高维状态空间,其中随机状态将无法到达,而预期的策略则位于低维流形上。
Equation 1 does not use discounting, which could be interpreted as modeling the human to be indifferent about when things happen in the trajectory segment. Using explicit discounting or inferring the human’s discount function would also be reasonable choices. 方程 1 不使用折现,这可以解释为将人类建模为对轨迹段中事件发生的时间无所谓。使用显式折现或推断人类的折现函数也是合理的选择。
^(4){ }^{4} In the case of Atari games with sparse rewards, it is relatively common for two clips to both have zero reward in which case the oracle outputs indifference. Because we considered clips rather than individual states, such ties never made up a large majority of our data. Moreover, ties still provide significant information to the reward predictor as long as they are not too common. 在奖励稀疏的 Atari 游戏中,两个片段都没有奖励的情况相对常见,这时神谕会输出无差别。由于我们考虑的是片段而不是单个状态,因此这种平局从未占据我们数据的大多数。此外,只要平局不太常见,它们仍然为奖励预测器提供了重要信息。
^(5){ }^{5} We only ran these tests on continuous control tasks because our Atari reward model depends on a sequence of consecutive frames rather than a single frame, as described in Section A. 2 ^(5){ }^{5} 我们仅在连续控制任务上进行了这些测试,因为我们的 Atari 奖励模型依赖于一系列连续的帧,而不是单个帧,如 A.2 节所述。
^(6){ }^{6} For the Atari experiments we are using a virtual machine with 16 CPUs and one Nvidia K80 GPU which costs ∼$700//\sim \$ 700 / month on GCE. Training takes about a day, so the compute cost is ∼$25\sim \$ 25. Training with 5 k labels corresponds roughly to 5 hours of human labour, at US minimum wage this totals ∼$36\sim \$ 36. ^(6){ }^{6} 对于 Atari 实验,我们使用一台具有 16 个 CPU 和一块 Nvidia K80 GPU 的虚拟机,费用为 ∼$700//\sim \$ 700 / 每月在 GCE 上。训练大约需要一天,因此计算成本为 ∼$25\sim \$ 25 。使用 5000 个标签的训练大致相当于 5 小时的人力劳动,按美国最低工资计算,总共为 ∼$36\sim \$ 36 。
^(7){ }^{7} All of these reward functions are second degree polynomials of the input features, and so if we were concerned only with these tasks we could take a simpler approach to learning the reward function. However, using this more flexible architecture allows us to immediately generalize to tasks for which the reward function is not so simple, as described in Section 3.2 ^(7){ }^{7} 所有这些奖励函数都是输入特征的二次多项式,因此如果我们只关注这些任务,我们可以采取更简单的方法来学习奖励函数。然而,使用这种更灵活的架构使我们能够立即推广到奖励函数不那么简单的任务,如第 3.2 节所述。