For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than 1%1 \% of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any which have been previously learned from human feedback. 为了使复杂的强化学习(RL)系统能够有效地与现实世界环境互动,我们需要将复杂的目标传达给这些系统。在这项工作中,我们探讨了基于(非专家)人类偏好对轨迹片段对进行定义的目标。我们表明,这种方法可以有效地解决复杂的 RL 任务,而无需访问奖励函数,包括 Atari 游戏和模拟机器人运动,同时在我们代理与环境的互动中提供不到 1%1 \% 的反馈。这大大降低了人类监督的成本,使其可以实际应用于最先进的 RL 系统。为了展示我们方法的灵活性,我们证明我们可以用大约一个小时的人类时间成功训练复杂的新行为。这些行为和环境比以往从人类反馈中学习的任何行为和环境都要复杂得多。
1 Introduction 1 引言
Recent success in scaling reinforcement learning (RL) to large problems has been driven in domains that have a well-specified reward function (Mnih et al. 2015, 2016, Silver et al., 2016). Unfortunately, many tasks involve goals that are complex, poorly-defined, or hard to specify. Overcoming this limitation would greatly expand the possible impact of deep RL and could increase the reach of machine learning more broadly. 最近在将强化学习(RL)扩展到大规模问题方面的成功,主要发生在具有明确奖励函数的领域(Mnih 等,2015,2016;Silver 等,2016)。不幸的是,许多任务涉及复杂、定义不清或难以具体化的目标。克服这一限制将大大扩展深度强化学习的潜在影响,并可能更广泛地提高机器学习的应用范围。
For example, suppose that we wanted to use reinforcement learning to train a robot to clean a table or scramble an egg. It’s not clear how to construct a suitable reward function, which will need to be a function of the robot’s sensors. We could try to design a simple reward function that approximately captures the intended behavior, but this will often result in behavior that optimizes our reward function without actually satisfying our preferences. This difficulty underlies recent concerns about misalignment between our values and the objectives of our RL systems (Bostrom, 2014, Russell, 2016, Amodei et al. 2016). If we could successfully communicate our actual objectives to our agents, it would be a significant step towards addressing these concerns. 例如,假设我们想使用强化学习来训练一个机器人清理桌子或打蛋。如何构建一个合适的奖励函数并不明确,这个函数需要依赖于机器人的传感器。我们可以尝试设计一个简单的奖励函数,大致捕捉预期的行为,但这往往会导致优化我们的奖励函数的行为,而实际上并没有满足我们的偏好。这一困难是最近关于我们价值观与强化学习系统目标之间不一致的担忧的根源(Bostrom, 2014; Russell, 2016; Amodei et al., 2016)。如果我们能够成功地将我们的实际目标传达给我们的代理,这将是解决这些担忧的重要一步。
If we have demonstrations of the desired task, we can extract a reward function using inverse reinforcement learning ( Ng and Russell, 2000). This reward function can then be used to train an agent with reinforcement learning. More directly, we can use imitation learning to clone the demonstrated behavior. However, these approaches are not directly applicable to behaviors that are difficult for humans to demonstrate (such as controlling a robot with many degrees of freedom but very non-human morphology). 如果我们有所需任务的示范,我们可以使用逆强化学习(Ng 和 Russell,2000)提取奖励函数。然后可以使用这个奖励函数来训练一个强化学习代理。更直接地,我们可以使用模仿学习来克隆示范的行为。然而,这些方法并不直接适用于人类难以示范的行为(例如控制一个具有许多自由度但形态非常非人类的机器人)。
An alternative approach is to allow a human to provide feedback on our system’s current behavior and to use this feedback to define the task. In principle this fits within the paradigm of reinforcement learning, but using human feedback directly as a reward function is prohibitively expensive for RL systems that require hundreds or thousands of hours of experience. In order to practically train deep RL systems with human feedback, we need to decrease the amount of feedback required by several orders of magnitude. 一种替代方法是允许人类对我们系统当前的行为提供反馈,并利用这些反馈来定义任务。从原则上讲,这符合强化学习的范式,但直接使用人类反馈作为奖励函数对于需要数百或数千小时经验的强化学习系统来说是极其昂贵的。为了实际训练带有人类反馈的深度强化学习系统,我们需要将所需反馈的数量减少几个数量级。
Our approach is to learn a reward function from human feedback and then to optimize that reward function. This basic approach has been considered previously, but we confront the challenges involved in scaling it up to modern deep RL and demonstrate by far the most complex behaviors yet learned from human feedback. 我们的方法是从人类反馈中学习一个奖励函数,然后优化该奖励函数。这种基本方法以前曾被考虑过,但我们面对将其扩展到现代深度强化学习所涉及的挑战,并展示了迄今为止从人类反馈中学习到的最复杂的行为。
In summary, we desire a solution to sequential decision problems without a well-specified reward function that 总之,我们希望找到一个解决没有明确奖励函数的序列决策问题的方案
enables us to solve tasks for which we can only recognize the desired behavior, but not necessarily demonstrate it, 使我们能够解决那些我们只能识别期望行为但不一定能够展示的任务
allows agents to be taught by non-expert users, 允许非专业用户教导代理
scales to large problems, and 扩展到大规模问题,并且
is economical with user feedback. 在用户反馈方面是经济的。
Our algorithm fits a reward function to the human’s preferences while simultaneously training a policy to optimize the current predicted reward function (see Figure 1). We ask the human to compare short video clips of the agent’s behavior, rather than to supply an absolute numerical score. We found comparisons to be easier for humans to provide in some domains, while being equally useful for learning human preferences. Comparing short video clips is nearly as fast as comparing individual states, but we show that the resulting comparisons are significantly more helpful. Moreover, we show that collecting feedback online improves the system’s performance and prevents it from exploiting weaknesses of the learned reward function. 我们的算法将奖励函数拟合到人类的偏好,同时训练一个策略以优化当前预测的奖励函数(见图 1)。我们要求人类比较代理行为的短视频片段,而不是提供绝对的数值评分。我们发现,在某些领域,比较更容易为人类提供,同时对学习人类偏好同样有用。比较短视频片段几乎与比较单个状态一样快,但我们展示了结果比较显著更有帮助。此外,我们还表明,在线收集反馈可以提高系统的性能,并防止其利用学习到的奖励函数的弱点。
Figure 1: Schematic illustration of our approach: the reward predictor is trained asynchronously from comparisons of trajectory segments, and the agent maximizes predicted reward. 图 1:我们方法的示意图:奖励预测器通过轨迹片段的比较进行异步训练,代理最大化预测的奖励。
Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator MuJoCo (Todorov et al., 2012). We show that a small amount of feedback from a non-expert human, ranging from fifteen minutes to five hours, suffices to learn most of the original RL tasks even when the reward function is not observable. We then consider some novel behaviors in each domain, such as performing a backflip or driving with the flow of traffic. We show that our algorithm can learn these behaviors from about an hour of feedback-even though it is unclear how to hand-engineer a reward function that would incentivize them. 我们的实验在两个领域进行:Atari 游戏的街机学习环境(Bellemare 等,2013)和物理模拟器 MuJoCo 中的机器人任务(Todorov 等,2012)。我们表明,来自非专家人类的少量反馈,从十五分钟到五小时,足以学习大多数原始强化学习任务,即使奖励函数不可观察。然后,我们考虑每个领域中的一些新颖行为,例如进行后空翻或与交通流量一起驾驶。我们展示了我们的算法可以从大约一个小时的反馈中学习这些行为——尽管尚不清楚如何手动设计一个能够激励这些行为的奖励函数。
1.1 Related Work 1.1 相关工作
A long line of work studies reinforcement learning from human ratings or rankings, including Akrour et al. (2011), Pilarski et al. (2011), Akrour et al. (2012), Wilson et al. (2012), Sugiyama et al. (2012), Wirth and Fürnkranz (2013), Daniel et al. (2015), El Asri et al. (2016), Wang et al. (2016), and Wirth et al. (2016). Other lines of research considers the general problem of reinforcement learning from preferences rather than absolute reward values (Fürnkranz et al., 2012; Akrour et al., 2014), and optimizing using human preferences in settings other than reinforcement learning (Machwe and Parmee, 2006, Secretan et al., 2008, Brochu et al., 2010, Sørensen et al., 2016). 一系列研究工作探讨了从人类评分或排名中进行强化学习,包括 Akrour 等人(2011),Pilarski 等人(2011),Akrour 等人(2012),Wilson 等人(2012),Sugiyama 等人(2012),Wirth 和 Fürnkranz(2013),Daniel 等人(2015),El Asri 等人(2016),Wang 等人(2016)和 Wirth 等人(2016)。其他研究方向则考虑了从偏好而非绝对奖励值进行强化学习的普遍问题(Fürnkranz 等人,2012;Akrour 等人,2014),以及在强化学习以外的环境中使用人类偏好进行优化(Machwe 和 Parmee,2006,Secretan 等人,2008,Brochu 等人,2010,Sørensen 等人,2016)。
Our algorithm follows the same basic approach as Akrour et al. (2012) and Akrour et al. (2014). They consider continuous domains with four degrees of freedom and small discrete domains, where they can assume that the reward is linear in the expectations of hand-coded features. We instead consider 我们的算法遵循与 Akrour 等人(2012 年)和 Akrour 等人(2014 年)相同的基本方法。他们考虑具有四个自由度的连续域和小的离散域,在这些域中,他们可以假设奖励在手动编码特征的期望中是线性的。我们则考虑
physics tasks with dozens of degrees of freedom and Atari tasks with no hand-engineered features; the complexity of our environments force us to use different RL algorithms and reward models, and to cope with different algorithmic tradeoffs. One notable difference is that Akrour et al. (2012) and Akrour et al. (2014) elicit preferences over whole trajectories rather than short clips. So although we gather about two orders of magnitude more comparisons, our experiments require less than one order of magnitude more human time. Other differences focus on changing our training procedure to cope with the nonlinear reward models and modern deep RL, for example using asynchronous training and ensembling. 物理任务具有数十个自由度,而 Atari 任务则没有手工设计的特征;我们环境的复杂性迫使我们使用不同的强化学习算法和奖励模型,并应对不同的算法权衡。一个显著的区别是,Akrour 等人(2012)和 Akrour 等人(2014)对整个轨迹而不是短片段进行偏好引导。因此,尽管我们收集了大约两个数量级更多的比较,但我们的实验所需的人力时间少于一个数量级。其他差异则集中在改变我们的训练程序,以应对非线性奖励模型和现代深度强化学习,例如使用异步训练和集成。
Our approach to feedback elicitation closely follows Wilson et al. (2012). However, Wilson et al. (2012) assumes that the reward function is the distance to some unknown “target” policy (which is itself a linear function of hand-coded features). They fit this reward function using Bayesian inference, and rather than performing RL they produce trajectories using the MAP estimate of the target policy. Their experiments involve “synthetic” human feedback which is drawn from their Bayesian model, while we perform experiments with feedback gathered from non-expert users. It is not clear if the methods in Wilson et al. (2012) can be extended to complex tasks or if they can work with real human feedback. 我们对反馈引导的处理方法紧密遵循 Wilson 等人(2012)的研究。然而,Wilson 等人(2012)假设奖励函数是与某个未知“目标”策略的距离(该策略本身是手动编码特征的线性函数)。他们使用贝叶斯推断来拟合这个奖励函数,而不是执行强化学习,而是使用目标策略的最大后验估计生成轨迹。他们的实验涉及从他们的贝叶斯模型中提取的“合成”人类反馈,而我们则进行与非专家用户收集的反馈的实验。目前尚不清楚 Wilson 等人(2012)的方法是否可以扩展到复杂任务,或者是否可以处理真实的人类反馈。
MacGlashan et al. (2017), Pilarski et al. (2011), Knox and Stone (2009), and Knox (2012) perform experiments involving reinforcement learning from actual human feedback, although their algorithmic approach is less similar. In MacGlashan et al. (2017) and Pilarski et al. (2011), learning only occurs during episodes where the human trainer provides feedback. This appears to be infeasible in domains like Atari games where thousands of hours of experience are required to learn a high-quality policy, and would be prohibitively expensive even for the simplest tasks we consider. TAMER (Knox, 2012, Knox and Stone, 2013) also learn a reward function, however they consider much simpler settings where the desired policy can be learned relatively quickly. MacGlashan 等人(2017),Pilarski 等人(2011),Knox 和 Stone(2009),以及 Knox(2012)进行的实验涉及从实际人类反馈中进行强化学习,尽管他们的算法方法不太相似。在 MacGlashan 等人(2017)和 Pilarski 等人(2011)中,学习仅在人类训练者提供反馈的情节中发生。这在像 Atari 游戏这样的领域似乎是不可行的,因为学习高质量策略需要数千小时的经验,即使是我们考虑的最简单任务也会成本过高。TAMER(Knox,2012,Knox 和 Stone,2013)也学习奖励函数,然而他们考虑的是可以相对快速学习所需策略的更简单设置。
Our work could also be seen of a specific instance of the cooperative inverse reinforcement learning framework (Hadfield-Menell et al., 2016). This framework considers a two-player game between a human and a robot interacting with an environment with the purpose of maximizing the human’s reward function. In our setting the human is only allowed to interact with this game by stating their preferences. 我们的工作也可以被视为合作逆强化学习框架的一个具体实例(Hadfield-Menell 等,2016)。该框架考虑了一个人类与机器人之间的双人游戏,目的是最大化人类的奖励函数。在我们的设置中,人类只能通过表达他们的偏好来与这个游戏互动。
Compared to all prior work, our key contribution is to scale human feedback up to deep reinforcement learning and to learn much more complex behaviors. This fits into a recent trend of scaling reward learning methods to large deep learning systems, for example inverse RL (Finn et al., 2016), imitation learning (Ho and Ermon, 2016; Stadie et al., 2017), semi-supervised skill generalization (Finn et al., 2017), and bootstrapping RL from demonstrations (Silver et al., 2016, Hester et al., 2017). 与之前的所有工作相比,我们的关键贡献是将人类反馈扩展到深度强化学习,并学习更复杂的行为。这符合最近将奖励学习方法扩展到大型深度学习系统的趋势,例如逆强化学习(Finn 等,2016),模仿学习(Ho 和 Ermon,2016;Stadie 等,2017),半监督技能泛化(Finn 等,2017)以及从示范中自举强化学习(Silver 等,2016;Hester 等,2017)。
2 Preliminaries and Method 2 初步和方法
2.1 Setting and Goal 2.1 设置和目标
We consider an agent interacting with an environment over a sequence of steps; at each time tt the agent receives an observation o_(t)inOo_{t} \in \mathcal{O} from the environment and then sends an action a_(t)inAa_{t} \in \mathcal{A} to the environment. 我们考虑一个代理在一系列步骤中与环境互动;在每个时间点 tt ,代理从环境中接收一个观察 o_(t)inOo_{t} \in \mathcal{O} ,然后向环境发送一个动作 a_(t)inAa_{t} \in \mathcal{A} 。
In traditional reinforcement learning, the environment would also supply a reward r_(t)inRr_{t} \in \mathbb{R} and the agent’s goal would be to maximize the discounted sum of rewards. Instead of assuming that the environment produces a reward signal, we assume that there is a human overseer who can express preferences between trajectory segments. A trajectory segment is a sequence of observations and actions, sigma=((o_(0),a_(0)),(o_(1),a_(1)),dots,(o_(k-1),a_(k-1)))in(OxxA)^(k)\sigma=\left(\left(o_{0}, a_{0}\right),\left(o_{1}, a_{1}\right), \ldots,\left(o_{k-1}, a_{k-1}\right)\right) \in(\mathcal{O} \times \mathcal{A})^{k}. Write sigma^(1)>-sigma^(2)\sigma^{1} \succ \sigma^{2} to indicate that the human preferred trajectory segment sigma^(1)\sigma^{1} to trajectory segment sigma^(2)\sigma^{2}. Informally, the goal of the agent is to produce trajectories which are preferred by the human, while making as few queries as possible to the human. 在传统的强化学习中,环境也会提供奖励 r_(t)inRr_{t} \in \mathbb{R} ,代理的目标是最大化折扣奖励的总和。我们不再假设环境产生奖励信号,而是假设有一个人类监督者可以在轨迹段之间表达偏好。轨迹段是一系列观察和动作 sigma=((o_(0),a_(0)),(o_(1),a_(1)),dots,(o_(k-1),a_(k-1)))in(OxxA)^(k)\sigma=\left(\left(o_{0}, a_{0}\right),\left(o_{1}, a_{1}\right), \ldots,\left(o_{k-1}, a_{k-1}\right)\right) \in(\mathcal{O} \times \mathcal{A})^{k} 。写 sigma^(1)>-sigma^(2)\sigma^{1} \succ \sigma^{2} 以表示人类更喜欢轨迹段 sigma^(1)\sigma^{1} 而不是轨迹段 sigma^(2)\sigma^{2} 。非正式地说,代理的目标是生成人类偏好的轨迹,同时尽可能少地向人类询问。
More precisely, we will evaluate our algorithms’ behavior in two ways: 更准确地说,我们将通过两种方式评估我们算法的行为:
Quantitative: We say that preferences >-\succ are generated by a reward function ^(1)r:OxxArarrR{ }^{1} r: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R} if 定量:我们说偏好 >-\succ 是由奖励函数 ^(1)r:OxxArarrR{ }^{1} r: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R} 生成的,如果
If the human’s preferences are generated by a reward function rr, then our agent ought to receive a high total reward according to rr. So if we know the reward function rr, we can evaluate the agent quantitatively. Ideally the agent will achieve reward nearly as high as if it had been using RL to optimize rr. 如果人类的偏好是由奖励函数 rr 生成的,那么我们的代理应该根据 rr 获得高总奖励。因此,如果我们知道奖励函数 rr ,我们可以定量评估代理。理想情况下,代理将获得的奖励几乎与使用强化学习优化 rr 时一样高。
Qualitative: Sometimes we have no reward function by which we can quantitatively evaluate behavior (this is the situation where our approach would be practically useful). In these cases, all we can do is qualitatively evaluate how well the agent satisfies to the human’s preferences. In this paper, we will start from a goal expressed in natural language, ask a human to evaluate the agent’s behavior based on how well it fulfills that goal, and then present videos of agents attempting to fulfill that goal. 定性:有时我们没有奖励函数来定量评估行为(这是我们的方法在实践中有用的情况)。在这些情况下,我们所能做的就是定性评估代理满足人类偏好的程度。在本文中,我们将从用自然语言表达的目标开始,请人类根据代理实现该目标的程度来评估其行为,然后展示代理尝试实现该目标的视频。
Our model based on trajectory segment comparisons is very similar to the trajectory preference queries used in Wilson et al. (2012), except that we don’t assume that we can reset the system to an arbitrary state ^(2){ }^{2} and so our segments generally begin from different states. This complicates the interpretation of human comparisons, but we show that our algorithm overcomes this difficulty even when the human raters have no understanding of our algorithm. 基于轨迹段比较的我们的模型与 Wilson 等人(2012)使用的轨迹偏好查询非常相似,除了我们不假设可以将系统重置为任意状态 ^(2){ }^{2} ,因此我们的段通常从不同的状态开始。这使得人类比较的解释变得复杂,但我们展示了即使人类评估者对我们的算法没有理解,我们的算法仍然克服了这一困难。
2.2 Our Method 2.2 我们的方法
At each point in time our method maintains a policy pi:OrarrA\pi: \mathcal{O} \rightarrow \mathcal{A} and a reward function estimate hat(r):OxxArarrR\hat{r}: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R}, each parametrized by deep neural networks. 在每个时间点,我们的方法保持一个策略 pi:OrarrA\pi: \mathcal{O} \rightarrow \mathcal{A} 和一个奖励函数估计 hat(r):OxxArarrR\hat{r}: \mathcal{O} \times \mathcal{A} \rightarrow \mathbb{R} ,这两者都是由深度神经网络参数化的。
These networks are updated by three processes: 这些网络通过三个过程进行更新:
The policy pi\pi interacts with the environment to produce a set of trajectories {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\}. The parameters of pi\pi are updated by a traditional reinforcement learning algorithm, in order to maximize the sum of the predicted rewards r_(t)= hat(r)(o_(t),a_(t))r_{t}=\hat{r}\left(o_{t}, a_{t}\right). 该政策 pi\pi 与环境互动以产生一组轨迹 {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\} 。 pi\pi 的参数通过传统的强化学习算法进行更新,以最大化预测奖励的总和 r_(t)= hat(r)(o_(t),a_(t))r_{t}=\hat{r}\left(o_{t}, a_{t}\right) 。
We select pairs of segments (sigma^(1),sigma^(2))\left(\sigma^{1}, \sigma^{2}\right) from the trajectories {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\} produced in step 1 , and send them to a human for comparison. 我们从步骤 1 中产生的轨迹 {tau^(1),dots,tau^(i)}\left\{\tau^{1}, \ldots, \tau^{i}\right\} 中选择一对段 (sigma^(1),sigma^(2))\left(\sigma^{1}, \sigma^{2}\right) ,并将其发送给人类进行比较。
The parameters of the mapping hat(r)\hat{r} are optimized via supervised learning to fit the comparisons collected from the human so far. 映射 hat(r)\hat{r} 的参数通过监督学习进行优化,以适应迄今为止从人类收集的比较。
These processes run asynchronously, with trajectories flowing from process (1) to process (2), human comparisons flowing from process (2) to process (3), and parameters for hat(r)\hat{r} flowing from process (3) to process (1). The following subsections provide details on each of these processes. 这些过程是异步运行的,轨迹从过程(1)流向过程(2),人类比较从过程(2)流向过程(3),而 hat(r)\hat{r} 的参数则从过程(3)流向过程(1)。以下小节提供了每个过程的详细信息。
2.2.1 Optimizing the Policy 2.2.1 优化政策
After using hat(r)\hat{r} to compute rewards, we are left with a traditional reinforcement learning problem. We can solve this problem using any RL algorithm that is appropriate for the domain. One subtlety is that the reward function hat(r)\hat{r} may be non-stationary, which leads us to prefer methods which are robust to changes in the reward function. This led us to focus on policy gradient methods, which have been applied successfully for such problems (Ho and Ermon, 2016). 在使用 hat(r)\hat{r} 计算奖励后,我们面临一个传统的强化学习问题。我们可以使用任何适合该领域的 RL 算法来解决这个问题。一个细微之处是奖励函数 hat(r)\hat{r} 可能是非平稳的,这使我们更倾向于选择对奖励函数变化具有鲁棒性的方法。这使我们专注于策略梯度方法,这些方法已成功应用于此类问题(Ho 和 Ermon,2016)。
In this paper, we use advantage actor-critic (A2C; Mnih et al. 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al. 2015) to perform simulated robotics tasks. In 在本文中,我们使用优势演员-评论家(A2C;Mnih 等,2016)来玩 Atari 游戏,并使用信任区域策略优化(TRPO;Schulman 等,2015)来执行模拟机器人任务。
each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyperparameter which we adjusted was the entropy bonus for TRPO. This is because TRPO relies on the trust region to ensure adequate exploration, which can lead to inadequate exploration if the reward function is changing. 在每个案例中,我们使用了已被发现适用于传统强化学习任务的参数设置。我们唯一调整的超参数是 TRPO 的熵奖励。这是因为 TRPO 依赖于信任区域以确保足够的探索,如果奖励函数发生变化,可能会导致探索不足。
We normalized the rewards produced by hat(r)\hat{r} to have zero mean and constant standard deviation. This is a typical preprocessing step which is particularly appropriate here since the position of the rewards is underdetermined by our learning problem. 我们对 hat(r)\hat{r} 产生的奖励进行了归一化,使其均值为零,标准差为常数。这是一个典型的预处理步骤,特别适合这里,因为奖励的位置在我们的学习问题中是欠确定的。
2.2.2 Preference Elicitation 2.2.2 偏好引导
The human overseer is given a visualization of two trajectory segments, in the form of short movie clips. In all of our experiments, these clips are between 1 and 2 seconds long. 人类监督者会看到两个轨迹段的可视化,以短视频片段的形式呈现。在我们所有的实验中,这些片段的长度在 1 到 2 秒之间。
The human then indicates which segment they prefer, that the two segments are equally good, or that they are unable to compare the two segments. 人类随后指明他们更喜欢哪个部分,两个部分同样好,或者他们无法比较这两个部分。
The human judgments are recorded in a database D\mathcal{D} of triples (sigma^(1),sigma^(2),mu)\left(\sigma^{1}, \sigma^{2}, \mu\right), where sigma^(1)\sigma^{1} and sigma^(2)\sigma^{2} are the two segments and mu\mu is a distribution over {1,2}\{1,2\} indicating which segment the user preferred. If the human selects one segment as preferable, then mu\mu puts all of its mass on that choice. If the human marks the segments as equally preferable, then mu\mu is uniform. Finally, if the human marks the segments as incomparable, then the comparison is not included in the database. 人类的判断记录在一个三元组数据库 D\mathcal{D} 中,其中 sigma^(1)\sigma^{1} 和 sigma^(2)\sigma^{2} 是两个部分, mu\mu 是一个分布,表示用户更喜欢哪个部分。如果人类选择一个部分作为更可取的,那么 mu\mu 会将所有的权重放在该选择上。如果人类将这些部分标记为同样可取,那么 mu\mu 是均匀的。最后,如果人类将这些部分标记为不可比较,那么该比较将不包含在数据库中。
2.2.3 Fitting the Reward Function 2.2.3 拟合奖励函数
We can interpret a reward function estimate hat(r)\hat{r} as a preference-predictor if we view hat(r)\hat{r} as a latent factor explaining the human’s judgments and assume that the human’s probability of preferring a segment sigma^(i)\sigma^{i} depends exponentially on the value of the latent reward summed over the length of the clip ^(3){ }^{3} 我们可以将奖励函数估计 hat(r)\hat{r} 解释为一个偏好预测器,如果我们将 hat(r)\hat{r} 视为解释人类判断的潜在因素,并假设人类偏好某个片段 sigma^(i)\sigma^{i} 的概率与潜在奖励的值呈指数关系,这个值是对片段长度 ^(3){ }^{3} 的总和
We choose hat(r)\hat{r} to minimize the cross-entropy loss between these predictions and the actual human labels: 我们选择 hat(r)\hat{r} 来最小化这些预测与实际人类标签之间的交叉熵损失:
This follows the Bradley-Terry model (Bradley and Terry, 1952) for estimating score functions from pairwise preferences, and is the specialization of the Luce-Shephard choice rule (Luce, 2005, Shepard, 1957) to preferences over trajectory segments. It can be understood as equating rewards with a preference ranking scale analogous to the famous Elo ranking system developed for chess (Elo 1978). Just as the difference in Elo points of two chess players estimates the probability of one player defeating the other in a game of chess, the difference in predicted reward of two trajectory segments estimates the probability that one is chosen over the other by the human. 这遵循布拉德利-特里模型(Bradley and Terry, 1952),用于从成对偏好中估计得分函数,并且是卢斯-谢泼德选择规则(Luce, 2005; Shepard, 1957)在轨迹段偏好上的特化。可以理解为将奖励等同于类似于著名的国际象棋埃洛排名系统(Elo 1978)的偏好排名尺度。正如两名国际象棋选手的埃洛积分差异估计一名选手在棋局中击败另一名选手的概率一样,两条轨迹段的预测奖励差异估计人类选择其中一条而非另一条的概率。
Our actual algorithm incorporates a number of modifications to this basic approach, which early experiments discovered to be helpful and which are analyzed in Section 3.3 我们的实际算法在这个基本方法上进行了多项修改,早期实验发现这些修改是有帮助的,具体分析见第 3.3 节。
We fit an ensemble of predictors, each trained on |D||\mathcal{D}| triples sampled from D\mathcal{D} with replacement. The estimate hat(r)\hat{r} is defined by independently normalizing each of these predictors and then averaging the results. 我们拟合了一组预测器,每个预测器在从 D\mathcal{D} 中有放回地抽样的 |D||\mathcal{D}| 个三元组上进行训练。估计 hat(r)\hat{r} 通过独立地对每个预测器进行归一化,然后平均结果来定义。
A fraction of 1//e1 / e of the data is held out to be used as a validation set for each predictor. We use ℓ_(2)\ell_{2} regularization and adjust the regularization coefficient to keep the validation loss between 1.1 and 1.5 times the training loss. In some domains we also apply dropout for regularization. 1//e1 / e 的一部分数据被保留作为每个预测器的验证集。我们使用 ℓ_(2)\ell_{2} 正则化,并调整正则化系数,以保持验证损失在训练损失的 1.1 到 1.5 倍之间。在某些领域,我们还应用 dropout 进行正则化。
Rather than applying a softmax directly as described in Equation 1, we assume there is a 10%10 \% chance that the human responds uniformly at random. Conceptually this adjustment is needed because human raters have a constant probability of making an error, which doesn’t decay to 0 as the difference in reward difference becomes extreme. 与其直接应用如公式 1 所述的 softmax,我们假设人类以均匀随机的方式响应的概率为 10%10 \% 。从概念上讲,这种调整是必要的,因为人类评分者在犯错时具有恒定的概率,这种概率不会随着奖励差异的极端化而降至 0。
2.2.4 Selecting Queries 2.2.4 选择查询
We decide how to query preferences based on an approximation to the uncertainty in the reward function estimator, similar to Daniel et al. (2014): we sample a large number of pairs of trajectory segments of length kk, use each reward predictor in our ensemble to predict which segment will be preferred from each pair, and then select those trajectories for which the predictions have the highest variance across ensemble members. This is a crude approximation and the ablation experiments in Section 3 show that in some tasks it actually impairs performance. Ideally, we would want to query based on the expected value of information of the query (Akrour et al., 2012; Krueger et al., 2016), but we leave it to future work to explore this direction further. 我们决定如何根据奖励函数估计器的不确定性近似来查询偏好,类似于 Daniel 等人(2014)的做法:我们从长度为 kk 的轨迹段中抽取大量对,使用我们集成中的每个奖励预测器来预测每对中哪个段会更受偏好,然后选择那些在集成成员中预测具有最高方差的轨迹。这是一个粗略的近似,第三节的消融实验表明,在某些任务中,它实际上会损害性能。理想情况下,我们希望根据查询的信息期望值进行查询(Akrour 等,2012;Krueger 等,2016),但我们将这一方向的进一步探索留给未来的工作。
3 Experimental Results 3 实验结果
We implemented our algorithm in TensorFlow (Abadi et al., 2016). We interface with MuJoCo (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al. 2013) through the OpenAI Gym (Brockman et al., 2016). 我们在 TensorFlow 中实现了我们的算法(Abadi 等,2016)。我们通过 OpenAI Gym(Brockman 等,2016)与 MuJoCo(Todorov 等,2012)和 Arcade Learning Environment(Bellemare 等,2013)进行接口。
3.1 Reinforcement Learning Tasks with Unobserved Rewards 3.1 具有未观察到奖励的强化学习任务
In our first set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. Instead, the agent learns about the goal of the task only by asking a human which of two trajectory segments is better. Our goal is to solve the task in a reasonable amount of time using as few queries as possible. 在我们的第一组实验中,我们尝试在不观察真实奖励的情况下解决一系列深度强化学习的基准任务。相反,智能体仅通过询问人类哪两个轨迹段更好来了解任务的目标。我们的目标是在合理的时间内使用尽可能少的查询来解决任务。
In our experiments, feedback is provided by contractors who are given a 1-2 sentence description of each task before being asked to compare several hundred to several thousand pairs of trajectory segments for that task (see Appendix B for the exact instructions given to contractors). Each trajectory segment is between 1 and 2 seconds long. Contractors responded to the average query in 3-53-5 seconds, and so the experiments involving real human feedback required between 30 minutes and 5 hours of human time. 在我们的实验中,承包商在被要求比较数百到数千对该任务的轨迹段之前,会收到每个任务的 1-2 句描述(具体指示见附录 B)。每个轨迹段的长度在 1 到 2 秒之间。承包商对平均查询的响应时间为 3-53-5 秒,因此涉及真实人类反馈的实验需要 30 分钟到 5 小时的人力时间。
For comparison, we also run experiments using a synthetic oracle whose preferences over trajectories exactly reflect reward in the underlying task. That is, when the agent queries for a comparison, instead of sending the query to a human, we immediately reply by indicating a preference for whichever trajectory segment actually receives a higher reward in the underlying task 4^(4)4^{4} We also compare to the baseline of RL training using the real reward. Our aim here is not to outperform but rather to do nearly as well as RL without access to reward information and instead relying on much scarcer feedback. Nevertheless, note that feedback from real humans does have the potential to outperform RL (and as shown below it actually does so on some tasks), because the human feedback might provide a better-shaped reward. 为了比较,我们还进行了实验,使用一个合成的 oracle,其对轨迹的偏好完全反映了基础任务中的奖励。也就是说,当代理请求比较时,我们并不是将查询发送给人类,而是立即回复,表明对实际在基础任务中获得更高奖励的轨迹段的偏好 4^(4)4^{4} 。我们还与使用真实奖励的 RL 训练基线进行了比较。我们在这里的目标不是超越,而是尽可能接近 RL,而不依赖于奖励信息,而是依赖于更稀缺的反馈。然而,请注意,来自真实人类的反馈确实有可能超越 RL(如下所示,在某些任务中确实如此),因为人类反馈可能提供更合理的奖励。
We describe the details of our experiments in Appendix A, including model architectures, modifications to the environment, and the RL algorithms used to optimize the policy. 我们在附录 A 中描述了实验的细节,包括模型架构、环境的修改以及用于优化策略的强化学习算法。
3.1.1 Simulated Robotics 3.1.1 模拟机器人技术
The first tasks we consider are eight simulated robotics tasks, implemented in MuJoCo (Todorov et al., 2012), and included in OpenAI Gym (Brockman et al., 2016). We made small modifications to these tasks in order to avoid encoding information about the task in the environment itself (the modifications are described in detail in Appendix A). The reward functions in these tasks are linear functions of distances, positions and velocities, and all are a quadratic function of the features. We included a simple cartpole task (“pendulum”) for comparison, since this is representative of the complexity of tasks studied in prior work. 我们考虑的第一个任务是八个模拟机器人任务,这些任务在 MuJoCo 中实现(Todorov 等,2012),并包含在 OpenAI Gym 中(Brockman 等,2016)。我们对这些任务进行了小的修改,以避免在环境中编码任务信息(修改的详细信息在附录 A 中描述)。这些任务中的奖励函数是距离、位置和速度的线性函数,并且所有函数都是特征的二次函数。我们包含了一个简单的倒立摆任务(“摆锤”)以作比较,因为这代表了先前研究中任务的复杂性。
Figure 2 shows the results of training our agent with 700 queries to a human rater, compared to learning from 350,700 , or 1400 synthetic queries, as well as to RL learning from the real reward. 图 2 显示了我们用 700 个查询训练代理的结果,与从 350,700 或 1400 个合成查询学习的结果,以及与从真实奖励进行强化学习的结果进行比较。
Figure 2: Results on MuJoCo simulated robotics as measured on the tasks’ true reward. We compare our method using real human feedback (purple), our method using synthetic feedback provided by an oracle (shades of blue), and reinforcement learning using the true reward function (orange). All curves are the average of 5 runs, except for the real human feedback, which is a single run, and each point is the average reward over five consecutive batches. For Reacher and Cheetah feedback was provided by an author due to time constraints. For all other tasks, feedback was provided by contractors unfamiliar with the environments and with our algorithm. The irregular progress on Hopper is due to one contractor deviating from the typical labeling schedule. 图 2:在 MuJoCo 模拟机器人上的结果,以任务的真实奖励为衡量标准。我们比较了使用真实人类反馈(紫色)、使用由神谕提供的合成反馈(蓝色阴影)以及使用真实奖励函数的强化学习(橙色)的方法。所有曲线都是 5 次运行的平均值,除了真实人类反馈,它是单次运行,每个点是五个连续批次的平均奖励。对于 Reacher 和 Cheetah,反馈是由一位作者提供的,因时间限制。对于所有其他任务,反馈是由对环境和我们的算法不熟悉的承包商提供的。Hopper 的不规则进展是由于一位承包商偏离了典型的标记时间表。
With 700 labels we are able to nearly match reinforcement learning on all of these tasks. Training with learned reward functions tends to be less stable and higher variance, while having a comparable mean performance. 通过 700 个标签,我们能够在所有这些任务上几乎匹配强化学习。使用学习到的奖励函数进行训练往往不太稳定,方差较高,但平均表现相当。
Surprisingly, by 1400 labels our algorithm performs slightly better than if it had simply been given the true reward, perhaps because the learned reward function is slightly better shaped-the reward learning procedure assigns positive rewards to all behaviors that are typically followed by high reward. 令人惊讶的是,到 1400 个标签时,我们的算法表现略好于仅仅给定真实奖励的情况,这可能是因为学习到的奖励函数形状稍好——奖励学习过程对所有通常伴随高奖励的行为分配正奖励。
Real human feedback is typically only slightly less effective than the synthetic feedback; depending on the task human feedback ranged from being half as efficient as ground truth feedback to being equally efficient. On the Ant task the human feedback significantly outperformed the synthetic feedback, apparently because we asked humans to prefer trajectories where the robot was “standing upright,” which proved to be useful reward shaping. (There was a similar bonus in the RL reward function to encourage the robot to remain upright, but the simple hand-crafted bonus was not as useful.) 真实的人类反馈通常仅比合成反馈稍微低效;根据任务的不同,人类反馈的效率从只有真实反馈的一半到与真实反馈同样高效。在蚂蚁任务中,人类反馈显著优于合成反馈,显然是因为我们要求人类偏好机器人“直立”的轨迹,这被证明是有用的奖励塑造。(在强化学习的奖励函数中也有类似的奖励,以鼓励机器人保持直立,但简单的手工制作奖励并没有那么有用。)
3.1.2 Atari 3.1.2 雅达利
The second set of tasks we consider is a set of seven Atari games in the Arcade Learning Environment (Bellemare et al. 2013), the same games presented in Mnih et al. 2013 我们考虑的第二组任务是一组七个在街机学习环境中的 Atari 游戏(Bellemare 等,2013),与 Mnih 等,2013 中呈现的游戏相同。
Figure 3 shows the results of training our agent with 5,500 queries to a human rater, compared to learning from 350, 700, or 1400 synthetic queries, as well as to RL learning from the real reward. Our method has more difficulty matching RL in these challenging environments, but nevertheless it displays substantial learning on most of them and matches or even exceeds RL on some. Specifically, on BeamRider and Pong, synthetic labels match or come close to RL even with only 3,300 such labels. On Seaquest and Qbert synthetic feedback eventually performs near the level of RL but learns more slowly. On SpaceInvaders and Breakout synthetic feedback never matches RL, but nevertheless the agent improves substantially, often passing the first level in SpaceInvaders and reaching a score of 20 on Breakout, or 50 with enough labels. 图 3 显示了我们用 5500 个查询训练代理的结果,与从 350、700 或 1400 个合成查询学习的结果相比,以及与真实奖励的强化学习(RL)学习的比较。我们的方法在这些具有挑战性的环境中更难以匹配 RL,但尽管如此,它在大多数环境中显示出显著的学习,并在某些情况下与 RL 相匹配或甚至超过 RL。具体而言,在 BeamRider 和 Pong 上,即使只有 3300 个合成标签,合成标签也与 RL 匹配或接近。在 Seaquest 和 Qbert 上,合成反馈最终的表现接近 RL 的水平,但学习速度较慢。在 SpaceInvaders 和 Breakout 上,合成反馈从未与 RL 匹配,但代理仍然显著改善,通常在 SpaceInvaders 中通过第一关,并在 Breakout 中达到 20 分,或者在有足够标签的情况下达到 50 分。
Figure 3: Results on Atari games as measured on the tasks’ true reward. We compare our method using real human feedback (purple), our method using synthetic feedback provided by an oracle (shades of blue), and reinforcement learning using the true reward function (orange). All curves are the average of 3 runs, except for the real human feedback which is a single run, and each point is the average reward over about 150,000 consecutive frames. 图 3:在 Atari 游戏上的结果,以任务的真实奖励为衡量标准。我们比较了使用真实人类反馈(紫色)、使用由神谕提供的合成反馈(蓝色阴影)以及使用真实奖励函数的强化学习(橙色)的方法。所有曲线都是 3 次运行的平均值,除了真实人类反馈是单次运行,每个点是约 150,000 帧连续的平均奖励。
Figure 4: Four frames from a single backflip. The agent is trained to perform a sequence of backflips, landing upright each time. The video is available at this link. 图 4:单次后空翻的四个画面。代理被训练执行一系列后空翻,每次都能直立着着陆。视频可在此链接查看。
On most of the games real human feedback performs similar to or slightly worse than synthetic feedback with the same number of labels, and often comparably to synthetic feedback that has 40%40 \% fewer labels. This may be due to human error in labeling, inconsistency between different contractors labeling the same run, or the uneven rate of labeling by contractors, which can cause labels to be overly concentrated in narrow parts of state space. The latter problems could potentially be addressed by future improvements to the pipeline for outsourcing labels. On Qbert, our method fails to learn to beat the first level with real human feedback; this may be because short clips in Qbert can be confusing and difficult to evaluate. Finally, Enduro is difficult for A3C to learn due to the difficulty of successfully passing other cars through random exploration, and is correspondingly difficult to learn with synthetic labels, but human labelers tend to reward any progress towards passing cars, essentially shaping the reward and thus outperforming A3C in this game (the results are comparable to those achieved with DQN). 在大多数游戏中,真实人类反馈的表现与具有相同标签数量的合成反馈相似或稍差,通常与标签数量减少 40%40 \% 的合成反馈相当。这可能是由于标记中的人为错误、不同承包商对同一运行的标记不一致,或承包商标记的速度不均匀,导致标签在状态空间的狭窄部分过于集中。后者的问题可能通过未来对外包标签流程的改进来解决。在 Qbert 中,我们的方法未能通过真实人类反馈学习击败第一关;这可能是因为 Qbert 中的短片段可能令人困惑且难以评估。最后,由于通过随机探索成功超越其他车辆的难度,Enduro 对 A3C 来说很难学习,并且使用合成标签学习也相应困难,但人类标记者往往会奖励任何朝着超越车辆的进展,从而塑造奖励,因此在这个游戏中超越了 A3C(结果与 DQN 取得的结果相当)。
3.2 Novel behaviors 3.2 新颖行为
Experiments with traditional RL tasks help us understand whether our method is effective, but the ultimate purpose of human interaction is to solve tasks for which no reward function is available. 传统强化学习任务的实验帮助我们理解我们的方法是否有效,但人类互动的最终目的是解决没有奖励函数的任务。
Using the same parameters as in the previous experiments, we show that our algorithm can learn novel complex behaviors. We demonstrate: 使用与之前实验相同的参数,我们展示了我们的算法可以学习新颖的复杂行为。我们证明:
The Hopper robot performing a sequence of backflips (see Figure 4). This behavior was trained using 900 queries in less than an hour. The agent learns to consistently perform a backflip, land upright, and repeat. 霍普机器人执行一系列后空翻(见图 4)。这种行为是在不到一个小时内通过 900 个查询进行训练的。代理学习到能够稳定地完成后空翻,直立着陆并重复。
Figure 5: Performance of our algorithm on MuJoCo tasks after removing various components, as described in Section Section 3.3. All graphs are averaged over 5 runs, using 700 synthetic labels each. 图 5:在去除各个组件后,我们的算法在 MuJoCo 任务上的表现,如第 3.3 节所述。所有图表都是基于 5 次运行的平均值,每次使用 700 个合成标签。
2. The Half-Cheetah robot moving forward while standing on one leg. This behavior was trained using 800 queries in under an hour. 半猎豹机器人在单腿站立的同时向前移动。这个行为是在不到一个小时内使用 800 个查询训练的。
3. Keeping alongside other cars in Enduro. This was trained with roughly 1,300 queries and 4 million frames of interaction with the environment; the agent learns to stay almost exactly even with other moving cars for a substantial fraction of the episode, although it gets confused by changes in background. 3. 在越野赛中与其他汽车保持并行。这是通过大约 1300 个查询和 400 万个与环境互动的帧进行训练的;智能体学习在很大一部分时间内与其他移动汽车几乎保持一致,尽管它会因背景变化而感到困惑。
Videos of these behaviors can be found at this link. These behaviors were trained using feedback from the authors. 这些行为的视频可以在这个链接找到。这些行为是通过作者的反馈进行训练的。
3.3 Ablation Studies 3.3 消融研究
In order to better understand the performance of our algorithm, we consider a range of modifications: 为了更好地理解我们算法的性能,我们考虑了一系列修改:
We pick queries uniformly at random rather than prioritizing queries for which there is disagreement (random queries). 我们随机均匀地选择查询,而不是优先选择存在分歧的查询(随机查询)。
We train only one predictor rather than an ensemble (no ensemble). In this setting, we also choose queries at random, since there is no longer an ensemble that we could use to estimate disagreement. 我们只训练一个预测器,而不是一个集成模型(没有集成)。在这种情况下,我们也随机选择查询,因为不再有集成模型可以用来估计分歧。
We train on queries only gathered at the beginning of training, rather than gathered throughout training (no online queries). 我们只在训练开始时收集的查询上进行训练,而不是在整个训练过程中收集的查询(没有在线查询)。
We remove the ℓ_(2)\ell_{2} regularization and use only dropout (no regularization). 我们去掉 ℓ_(2)\ell_{2} 正则化,仅使用 dropout(没有正则化)。
On the robotics tasks only, we use trajectory segments of length 1 (no segments). 在机器人任务中,我们仅使用长度为 1 的轨迹段(没有段)。
Rather than fitting hat(r)\hat{r} using comparisons, we consider an oracle which provides the true total reward over a trajectory segment, and fit hat(r)\hat{r} to these total rewards using mean squared error (target). 与其通过比较来拟合 hat(r)\hat{r} ,我们考虑一个提供轨迹段真实总奖励的 oracle,并使用均方误差(目标)将 hat(r)\hat{r} 拟合到这些总奖励上。
The results are presented in Figure 5 for MuJoCo and Figure 6for Atari. 结果在图 5 中展示了 MuJoCo,在图 6 中展示了 Atari。
Of particular interest is the poor performance of offline reward predictor training; here we find that due to the nonstationarity of the occupancy distribution, the predictor captures only part of the true reward, and maximizing this partial reward can lead to bizarre behavior that is undesirable as measured by the true reward (Amodei et al. 2016). For instance, on Pong offline training sometimes leads our agent to avoid losing points but not to score points; this can result in extremely long volleys 特别值得关注的是离线奖励预测器训练的糟糕表现;在这里我们发现,由于占用分布的非平稳性,预测器仅捕捉到真实奖励的一部分,最大化这一部分奖励可能导致奇怪的行为,这种行为在真实奖励的衡量下是不可取的(Amodei et al. 2016)。例如,在乒乓球游戏中,离线训练有时会导致我们的代理避免失分,但不去得分;这可能导致极长的对打。
Figure 6: Performance of our algorithm on Atari tasks after removing various components, as described in Section 3.3. All curves are an average of 3 runs using 5,500 synthetic labels (see minor exceptions in Section A.2. 图 6:在去除各种组件后,我们的算法在 Atari 任务上的表现,如第 3.3 节所述。所有曲线是使用 5,500 个合成标签进行 3 次运行的平均值(见附录 A.2 中的小例外)。
that repeat the same sequence of events ad infinitum (videos at this link). This type of behavior demonstrates that in general human feedback needs to be intertwined with RL learning rather than provided statically. 这种行为表明,通常人类反馈需要与强化学习(RL)学习交织在一起,而不是静态提供。
Our main motivation for eliciting comparisons rather than absolute scores was that we found it much easier for humans to provide consistent comparisons than consistent absolute scores, especially on the continuous control tasks and on the qualitative tasks in Section 3.2, nevertheless it seems important to understand how using comparisons affects performance. For continuous control tasks we found that predicting comparisons worked much better than predicting scores. This is likely because the scale of rewards varies substantially and this complicates the regression problem, which is smoothed significantly when we only need to predict comparisons. In the Atari tasks we clipped rewards and effectively only predicted the sign, avoiding these difficulties (this is not a suitable solution for the continuous control tasks because the relative magnitude of the reward are important to learning). In these tasks comparisons and targets had significantly different performance, but neither consistently outperformed the other. 我们主要的动机是引导比较而不是绝对分数,因为我们发现人类提供一致的比较比提供一致的绝对分数要容易得多,特别是在连续控制任务和第 3.2 节的定性任务中,尽管理解使用比较如何影响性能似乎很重要。对于连续控制任务,我们发现预测比较的效果远好于预测分数。这可能是因为奖励的规模变化很大,这使得回归问题变得复杂,而当我们只需要预测比较时,这个问题得到了显著的平滑。在 Atari 任务中,我们截断了奖励,实际上只预测了符号,避免了这些困难(这对于连续控制任务来说不是一个合适的解决方案,因为奖励的相对大小对学习很重要)。在这些任务中,比较和目标的表现有显著差异,但没有一个 consistently 超过另一个。
We also observed large performance differences when using single frames rather than clips 5^(5)5^{5} In order to obtain the same results using single frames we would need to have collected significantly more comparisons. In general we discovered that asking humans to compare longer clips was significantly more helpful per clip, and significantly less helpful per frame. We found that for short clips it took human raters a while just to understand the situation, while for longer clips the evaluation time was a roughly linear function of the clip length. We tried to choose the shortest clip length for which the evaluation time was linear. In the Atari environments we also found that it was often easier to compare longer clips because they provide more context than single frames. 我们还观察到使用单帧而不是片段时,性能差异很大 5^(5)5^{5} 。为了使用单帧获得相同的结果,我们需要收集显著更多的比较。一般来说,我们发现让人类比较较长的片段在每个片段上要显著更有帮助,而在每帧上则显著不那么有帮助。我们发现,对于短片段,人类评估者需要一些时间才能理解情况,而对于较长的片段,评估时间大致是片段长度的线性函数。我们尝试选择评估时间呈线性的最短片段长度。在 Atari 环境中,我们还发现比较较长的片段通常更容易,因为它们提供了比单帧更多的上下文。
4 Discussion and Conclusions 4 讨论与结论
Agent-environment interactions are often radically cheaper than human interaction. We show that by learning a separate reward model using supervised learning, it is possible to reduce the interaction complexity by roughly 3 orders of magnitude. Not only does this show that we can meaningfully train deep RL agents from human preferences, but also that we are already hitting diminishing returns 代理与环境的交互通常比人类交互便宜得多。我们展示了通过使用监督学习学习一个单独的奖励模型,可以将交互复杂性降低大约三个数量级。这不仅表明我们可以从人类偏好中有意义地训练深度强化学习代理,而且还表明我们已经达到了收益递减的阶段。
on further sample-complexity improvements because the cost of compute is already comparable to the cost of non-expert feedback ^(6){ }^{6} 在进一步的样本复杂性改进上,因为计算成本已经与非专家反馈的成本相当 ^(6){ }^{6}
Although there is a large literature on preference elicitation and reinforcement learning from unknown reward functions, we provide the first evidence that these techniques can be economically scaled up to state-of-the-art reinforcement learning systems. This represents a step towards practical applications of deep RL to complex real-world tasks. 尽管关于偏好引导和从未知奖励函数中进行强化学习的文献很多,但我们提供了第一个证据,表明这些技术可以在经济上扩展到最先进的强化学习系统。这标志着深度强化学习在复杂现实任务中的实际应用迈出了重要一步。
Future work may be able to improve the efficiency of learning from human preferences, and expand the range of tasks to which it can be applied. 未来的工作可能会提高从人类偏好中学习的效率,并扩大其适用的任务范围。
In the long run it would be desirable to make learning a task from human preferences no more difficult than learning it from a programmatic reward signal, ensuring that powerful RL systems can be applied in the service of complex human values rather than low-complexity goals. 从长远来看,使学习人类偏好的任务不比从程序化奖励信号中学习更困难是理想的,这样可以确保强大的强化学习系统能够服务于复杂的人类价值观,而不是低复杂度的目标。
Acknowledgments 致谢
We thank Olivier Pietquin, Bilal Piot, Laurent Orseau, Pedro Ortega, Victoria Krakovna, Owain Evans, Andrej Karpathy, Igor Mordatch, and Jack Clark for reading drafts of the paper. We thank Tyler Adkisson, Mandy Beri, Jessica Richards, Heather Tran, and other contractors for providing the data used to train our agents. Finally, we thank OpenAI and DeepMind for providing a supportive research environment and for supporting and encouraging this collaboration. 我们感谢 Olivier Pietquin、Bilal Piot、Laurent Orseau、Pedro Ortega、Victoria Krakovna、Owain Evans、Andrej Karpathy、Igor Mordatch 和 Jack Clark 阅读论文草稿。我们感谢 Tyler Adkisson、Mandy Beri、Jessica Richards、Heather Tran 以及其他承包商提供用于训练我们代理的数据。最后,我们感谢 OpenAI 和 DeepMind 提供支持性的研究环境,并支持和鼓励这次合作。
References 参考文献
Martin Abadi et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. 马丁·阿巴迪等。Tensorflow:在异构分布式系统上进行大规模机器学习。arXiv 预印本 arXiv:1603.04467,2016。
Riad Akrour, Marc Schoenauer, and Michele Sebag. Preference-based policy learning. Machine learning and knowledge discovery in databases, pages 12-27, 2011. 里亚德·阿克鲁尔,马克·肖诺,米歇尔·塞巴格。基于偏好的政策学习。《机器学习与数据库中的知识发现》,第 12-27 页,2011 年。
Riad Akrour, Marc Schoenauer, and Michèle Sebag. April: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 116-131, 2012. Riad Akrour, Marc Schoenauer 和 Michèle Sebag. 四月:基于主动偏好学习的强化学习. 在欧洲机器学习与数据库知识发现联合会议上, 第 116-131 页, 2012 年.
Riad Akrour, Marc Schoenauer, Michèle Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, pages 1503-1511, 2014. 里亚德·阿克鲁尔,马克·舍诺,米歇尔·塞巴格,和让-克里斯托夫·苏普莱。通过反馈编程。在国际机器学习会议,页码 1503-1511,2014 年。
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016. 达里奥·阿莫代伊,克里斯·奥拉赫,雅各布·斯坦哈特,保罗·克里斯蒂亚诺,约翰·舒尔曼,丹·马内。人工智能安全中的具体问题。arXiv 预印本 arXiv:1606.06565,2016。
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253-279,201347: 253-279,2013. 马克·G·贝尔马尔,雅瓦尔·纳达夫,乔尔·维尼斯和迈克尔·博林。《街机学习环境:通用智能体的评估平台》。人工智能研究杂志, 47:253-279,201347: 253-279,2013 。
Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014. 尼克·博斯特罗姆。《超级智能:路径、危险、策略》。牛津大学出版社,2014 年。
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324-345, 1952. 拉尔夫·艾伦·布拉德利和米尔顿·E·特里。 不完全区组设计的等级分析:I. 配对比较法。 生物统计学,39(3/4):324-345,1952。
Eric Brochu, Tyson Brochu, and Nando de Freitas. A bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 103-112. Eurographics Association, 2010. 埃里克·布罗丘,泰森·布罗丘,南多·德·弗雷塔斯。基于贝叶斯的交互式优化方法用于程序动画设计。载于 2010 年 ACM SIGGRAPH/欧洲图形学计算机动画研讨会论文集,页码 103-112。欧洲图形学协会,2010 年。
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016. 格雷格·布罗克曼,维基·张,路德维希·佩特松,乔纳斯·施奈德,约翰·舒尔曼,唐杰,和沃伊切赫·扎伦巴。OpenAI Gym。arXiv 预印本 arXiv:1606.01540,2016。
Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Robotics: Science and Systems, 2014. 克里斯蒂安·丹尼尔,马尔特·维林,扬·梅茨,奥利弗·克罗默,扬·彼得斯。主动奖励学习。在《机器人:科学与系统》,2014 年。
Christian Daniel, Oliver Kroemer, Malte Viering, Jan Metz, and Jan Peters. Active reward learning with a novel acquisition function. Autonomous Robots, 39(3):389-405, 2015. 克里斯蒂安·丹尼尔,奥利弗·克罗默,马尔特·维林,扬·梅茨,扬·彼得斯。使用新型获取函数的主动奖励学习。《自主机器人》,39(3):389-405,2015。
Layla El Asri, Bilal Piot, Matthieu Geist, Romain Laroche, and Olivier Pietquin. Score-based inverse reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 457-465,2016457-465,2016. 莱拉·埃尔·阿斯里,比拉尔·皮奥特,马蒂厄·盖斯特,罗曼·拉罗什,和奥利维尔·皮特昆。基于评分的逆强化学习。在国际自主代理与多智能体系统会议上,页码 457-465,2016457-465,2016 。
Arpad Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978. 阿尔帕德·埃洛。《棋手的评级,过去与现在》。阿尔科出版社,1978 年。
Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, volume 48, 2016. 切尔西·芬恩,谢尔盖·莱文,和皮特·阿贝尔。引导成本学习:通过策略优化的深度逆最优控制。在国际机器学习会议,卷 48,2016 年。
Chelsea Finn, Tianhe Yu, Justin Fu, Pieter Abbeel, and Sergey Levine. Generalizing skills with semi-supervised reinforcement learning. In International Conference on Learning Representations, 2017. 切尔西·芬,田禾·余,贾斯廷·傅,皮特·阿贝尔,和谢尔盖·莱文。通过半监督强化学习推广技能。在国际学习表征会议,2017 年。
Johannes Fürnkranz, Eyke Hüllermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: A formal framework and a policy iteration algorithm. Machine learning, 89(1-2):123-156, 2012. 约翰内斯·弗尔克兰茨,艾克·赫勒迈尔,程伟伟,朴相贤。基于偏好的强化学习:一个正式框架和一个策略迭代算法。《机器学习》,89(1-2):123-156,2012。
Dylan Hadfield-Menell, Stuart Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 3909-3917, 2016. 迪伦·哈德菲尔德-梅内尔,斯图尔特·拉塞尔,皮特·阿贝尔,安卡·德拉甘。合作逆强化学习。载于《神经信息处理系统进展》,第 3909-3917 页,2016 年。
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z Leibo, and Audrunas Gruslys. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732, 2017. 托德·赫斯特,马泰·维切里克,奥利维耶·皮特昆,马克·兰克托,汤姆·肖尔,比拉尔·皮奥特,安德鲁·森多纳里斯,加布里埃尔·杜拉克-阿诺德,伊恩·奥斯班德,约翰·阿加皮乌,乔尔·Z·莱博,和奥德鲁纳斯·格鲁斯利斯。通过示范学习进行现实世界的强化学习。arXiv 预印本 arXiv:1704.03732,2017 年。
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565-4573, 2016. 乔纳森·霍和斯特凡诺·埃尔蒙。生成对抗模仿学习。在《神经信息处理系统进展》,第 4565-4573 页,2016 年。
W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The TAMER framework. In International Conference on Knowledge Capture, pages 9-16, 2009. W·布拉德利·诺克斯和彼得·斯通。通过人类强化互动塑造智能体:TAMER 框架。在国际知识捕获会议,页码 9-16,2009 年。
W. Bradley Knox and Peter Stone. Learning non-myopically from human-generated reward. In Jihie Kim, Jeffrey Nichols, and Pedro A. Szekely, editors, IUI, pages 191-202. ACM, 2013. ISBN 978-1-4503-1965-2. URL http://doi.acm.org/10.1145/2449396. W. Bradley Knox 和 Peter Stone. 从人类生成的奖励中进行非短视学习. 在 Jihie Kim, Jeffrey Nichols 和 Pedro A. Szekely 编辑的 IUI, 第 191-202 页. ACM, 2013. ISBN 978-1-4503-1965-2. URL http://doi.acm.org/10.1145/2449396.
William Bradley Knox. Learning from human-generated reward. PhD thesis, University of Texas at Austin, 2012. 威廉·布拉德利·诺克斯。学习人类生成的奖励。博士论文,德克萨斯大学奥斯汀分校,2012 年。
David Krueger, Jan Leike, Owain Evans, and John Salvatier. Active reinforcement learning: Observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop, 2016. 大卫·克鲁格,简·莱克,欧温·埃文斯,约翰·萨尔瓦提尔。主动强化学习:以成本观察奖励。在《互动学习机器的未来》,NIPS 研讨会,2016 年。
R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005. R Duncan Luce. 个人选择行为:理论分析。快递公司,2005 年。
James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049, 2017. 詹姆斯·麦克拉尚,马克·K·霍,罗伯特·洛夫廷,贝·彭,戴维·罗伯茨,马修·E·泰勒,和迈克尔·L·利特曼。来自政策依赖的人类反馈的互动学习。arXiv 预印本 arXiv:1701.06049,2017 年。
AT Machwe and IC Parmee. Introducing machine learning within an interactive evolutionary design environment. In DS 36: Proceedings DESIGN 2006, the 9th International Design Conference, Dubrovnik, Croatia, 2006. AT Machwe 和 IC Parmee。 在交互式进化设计环境中引入机器学习。 在 DS 36:2006 年杜布罗夫尼克第九届国际设计会议论文集,克罗地亚,2006。
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 弗拉基米尔·姆尼赫,科雷·卡武克乔格鲁,大卫·西尔弗,亚历克斯·格雷夫斯,伊奥尼斯·安东诺格鲁,丹·维尔斯特拉,马丁·里德米勒。使用深度强化学习玩雅达利游戏。arXiv 预印本 arXiv:1312.5602,2013 年。
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533, 2015. 弗拉基米尔·姆尼赫,科雷·卡武克丘奥卢,大卫·西尔弗,安德烈·A·鲁苏,乔尔·维内斯,马克·G·贝尔梅尔,亚历克斯·格雷夫斯,马丁·里德米勒,安德烈亚斯·K·菲杰兰德,乔治·奥斯特罗夫斯基,斯蒂格·彼得森,查尔斯·比蒂,阿米尔·萨迪克,伊奥尼斯·安东诺格鲁,海伦·金,达尔尚·库马兰,丹·维尔斯特拉,肖恩·莱格,德米斯·哈萨比斯。通过深度强化学习实现人类水平的控制。《自然》,518(7540):529-533,2015。
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928-1937, 2016. 弗拉基米尔·姆尼赫,阿德里亚·普伊格多梅内赫·巴迪亚,梅赫迪·米尔扎,亚历克斯·格雷夫斯,蒂莫西·利利克拉普,蒂姆·哈利,戴维·西尔弗,和科拉伊·卡武克丘。深度强化学习的异步方法。在国际机器学习会议,页码 1928-1937,2016 年。
Andrew Y Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine learning, pages 663-670, 2000. 安德鲁·Y·吴和斯图尔特·拉塞尔。逆强化学习的算法。在国际机器学习会议,663-670 页,2000 年。
Patrick M Pilarski, Michael R Dawson, Thomas Degris, Farbod Fahimi, Jason P Carey, and Richard Sutton. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning. In International Conference on Rehabilitation Robotics, pages 1-7, 2011. 帕特里克·M·皮拉斯基,迈克尔·R·道森,托马斯·德格里斯,法尔博德·法希米,杰森·P·凯里,理查德·萨顿。通过演员-评论家强化学习在线训练肌电假肢控制器。在国际康复机器人会议,页码 1-7,2011 年。
Stuart Russell. Should we fear supersmart robots? Scientific American, 314(6):58, 2016. 斯图尔特·拉塞尔。我们应该害怕超级智能机器人吗?《科学美国人》,314(6):58,2016。
John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889-1897, 2015. 约翰·舒尔曼,谢尔盖·莱文,皮特·阿贝尔,迈克尔·I·乔丹,和菲利普·莫里茨。信任区域策略优化。在国际机器学习会议,页码 1889-1897,2015 年。
Jimmy Secretan, Nicholas Beato, David B D Ambrosio, Adelein Rodriguez, Adam Campbell, and Kenneth O Stanley. Picbreeder: Evolving pictures collaboratively online. In Conference on Human Factors in Computing Systems, pages 1759-1768, 2008. 吉米·西克坦,尼古拉斯·比亚托,大卫·B·D·安布罗西奥,阿德琳·罗德里格斯,亚当·坎贝尔,肯尼斯·O·斯坦利。Picbreeder:在线协作进化图片。在计算机系统人因会议,页码 1759-1768,2008 年。
Roger N Shepard. Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space. Psychometrika, 22(4):325-345, 1957. 罗杰·N·谢泼德。刺激与反应的泛化:一个将泛化与心理空间中的距离相关联的随机模型。《心理计量学》,22(4):325-345,1957。
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489, 2016. 大卫·西尔弗,阿贾·黄,克里斯·J·马迪森,亚瑟·古兹,洛朗·西弗,乔治·范登·德里斯切,朱利安·施特维泽,伊奥尼斯·安东诺格鲁,维达·帕内尔舍尔瓦姆,马克·兰克托,桑德·迪尔曼,多米尼克·格雷韦,约翰·南,纳尔·卡尔赫布伦纳,伊利亚·苏茨克弗,蒂莫西·利利克拉普,马德琳·利奇,科雷·卡武克朱奥格,托雷·格雷佩尔,德米斯·哈萨比斯。通过深度神经网络和树搜索掌握围棋游戏。《自然》,529(7587):484-489,2016。
Patrikk D Sørensen, Jeppeh M Olsen, and Sebastian Risi. Breeding a diversity of super mario behaviors through interactive evolution. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pages 1-7. IEEE, 2016. 帕特里克·D·索伦森,耶佩·M·奥尔森,和塞巴斯蒂安·里西。通过互动进化培育多样化的超级马里奥行为。在 2016 年 IEEE 计算智能与游戏会议(CIG)上,页码 1-7。IEEE,2016 年。
Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. In International Conference on Learning Representations, 2017. 布拉德利·C·斯塔迪、皮特·阿贝尔和伊利亚·苏茨克维尔。第三人称模仿学习。在国际学习表征会议,2017 年。
Hiroaki Sugiyama, Toyomi Meguro, and Yasuhiro Minami. Preference-learning based inverse reinforcement learning for dialog control. In INTERSPEECH, pages 222-225, 2012. 杉山弘明,目黑丰美,南康宏。基于偏好学习的对话控制逆强化学习。在 INTERSPEECH,222-225 页,2012 年。
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pages 5026-5033, 2012. 埃马纽尔·托多罗夫,汤姆·埃雷兹,和尤瓦尔·塔萨。Mujoco:一种用于基于模型控制的物理引擎。在国际智能机器人与系统会议上,页码 5026-5033,2012 年。
Sida I Wang, Percy Liang, and Christopher D Manning. Learning language games through interaction. arXiv preprint arXiv:1606.02447, 2016. Sida I Wang, Percy Liang 和 Christopher D Manning. 通过互动学习语言游戏. arXiv 预印本 arXiv:1606.02447, 2016.
Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, pages 1133-1141, 2012. 亚伦·威尔逊,阿兰·费恩,普拉萨德·塔德帕利。基于贝叶斯的方法从轨迹偏好查询中学习策略。在《神经信息处理系统进展》中,页码 1133-1141,2012 年。
Christian Wirth and Johannes Fürnkranz. Preference-based reinforcement learning: A preliminary survey. In ECML/PKDD Workshop on Reinforcement Learning from Generalized Feedback: Beyond Numeric Rewards, 2013. 克里斯蒂安·维尔特和约翰内斯·弗尔克兰茨。基于偏好的强化学习:初步调查。在 2013 年 ECML/PKDD 关于从广义反馈中进行强化学习的研讨会:超越数值奖励。
Christian Wirth, J Fürnkranz, Gerhard Neumann, et al. Model-free preference-based reinforcement learning. In AAAIA A A I, pages 2222-2228,20162222-2228,2016. 克里斯蒂安·维尔特,J·弗尔克兰茨,格哈德·诺伊曼等。无模型的基于偏好的强化学习。在 AAAIA A A I