这是用户在 2024-3-24 18:14 为 https://ar5iv.labs.arxiv.org/html/1812.02900 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Off-Policy Deep Reinforcement Learning without Exploration

Scott Fujimoto  斯科特·藤本    David Meger  大卫·梅格    Doina Precup  多伊娜·普雷库普

Off-Policy Deep Reinforcement Learning without Exploration:
Supplementary Material

Scott Fujimoto  斯科特·藤本    David Meger  大卫·梅格    Doina Precup  多伊娜·普雷库普
Abstract 摘要

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
许多强化学习的实际应用都限制代理从已经收集的固定数据批次中学习,而不提供进一步的数据收集可能性。本文证明,由于外推引入的错误,标准的离策略深度强化学习算法,如 DQN 和 DDPG,无法学习没有与当前策略下的分布相关的数据,使它们在这种固定批次设置下无效。我们引入了一种新颖的离策略算法类别,批次约束强化学习,它通过限制动作空间来迫使代理在行为上接近于对给定数据子集的策略。我们提出了第一个可以有效地从任意固定批次数据中学习的连续控制深度强化学习算法,并在几个任务中经验性地展示了其行为的质量。

Machine Learning, ICML, Deep Reinforcement Learning, Imitation Learning, Batch Reinforcement Learning, Off-Policy

1 Introduction 介绍

Batch reinforcement learning, the task of learning from a fixed dataset without further interactions with the environment, is a crucial requirement for scaling reinforcement learning to tasks where the data collection procedure is costly, risky, or time-consuming. Off-policy batch reinforcement learning has important implications for many practical applications. It is often preferable for data collection to be performed by some secondary controlled process, such as a human operator or a carefully monitored program. If assumptions on the quality of the behavioral policy can be made, imitation learning can be used to produce strong policies. However, most imitation learning algorithms are known to fail when exposed to suboptimal trajectories, or require further interactions with the environment to compensate (Hester et al., 2017; Sun et al., 2018; Cheng et al., 2018). On the other hand, batch reinforcement learning offers a mechanism for learning from a fixed dataset without restrictions on the quality of the data.

Most modern off-policy deep reinforcement learning algorithms fall into the category of growing batch learning (Lange et al., 2012), in which data is collected and stored into an experience replay dataset (Lin, 1992), which is used to train the agent before further data collection occurs. However, we find that these “off-policy” algorithms can fail in the batch setting, becoming unsuccessful if the dataset is uncorrelated to the true distribution under the current policy. Our most surprising result shows that off-policy agents perform dramatically worse than the behavioral agent when trained with the same algorithm on the same dataset.
大多数现代离策略深度强化学习算法属于不断增长的批量学习类别(Lange 等,2012 年),其中数据被收集并存储到经验重放数据集(Lin,1992 年)中,该数据集用于在进一步数据收集之前训练代理。然而,我们发现这些“离策略”算法在批量设置中可能会失败,如果数据集与当前策略下的真实分布不相关,则会失败。我们最令人惊讶的结果表明,当使用相同算法在相同数据集上训练时,离策略代理的表现明显比行为代理差。

This inability to learn truly off-policy is due to a fundamental problem with off-policy reinforcement learning we denote extrapolation error, a phenomenon in which unseen state-action pairs are erroneously estimated to have unrealistic values. Extrapolation error can be attributed to a mismatch in the distribution of data induced by the policy and the distribution of data contained in the batch. As a result, it may be impossible to learn a value function for a policy which selects actions not contained in the batch.

To overcome extrapolation error in off-policy learning, we introduce batch-constrained reinforcement learning, where agents are trained to maximize reward while minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch. Our deep reinforcement learning algorithm, Batch-Constrained deep Q-learning (BCQ), uses a state-conditioned generative model to produce only previously seen actions. This generative model is combined with a Q-network, to select the highest valued action which is similar to the data in the batch. Under mild assumptions, we prove this batch-constrained paradigm is necessary for unbiased value estimation from incomplete datasets for finite deterministic MDPs.
为了克服离线学习中的外推误差,我们引入了批量约束强化学习,其中代理被训练以最大化奖励,同时最小化策略中的状态-动作访问与批量中包含的状态-动作对之间的不匹配。我们的深度强化学习算法,批量约束深度 Q 学习(BCQ),使用状态条件生成模型仅生成先前见过的动作。这个生成模型与 Q 网络结合,选择价值最高的动作,该动作与批量数据中的数据相似。在温和的假设下,我们证明这种批量约束范式对于有限确定性 MDP 的不完整数据集的无偏值估计是必要的。

Unlike any previous continuous control deep reinforcement learning algorithms, BCQ is able to learn successfully without interacting with the environment by considering extrapolation error. Our algorithm is evaluated on a series of batch reinforcement learning tasks in MuJoCo environments (Todorov et al., 2012; Brockman et al., 2016), where extrapolation error is particularly problematic due to the high-dimensional continuous action space, which is impossible to sample exhaustively. Our algorithm offers a unified view on imitation and off-policy learning, and is capable of learning from purely expert demonstrations, as well as from finite batches of suboptimal data, without further exploration. We remark that BCQ is only one way to approach batch-constrained reinforcement learning in a deep setting, and we hope that it will be serve as a foundation for future algorithms. To ensure reproducibility, we provide precise experimental and implementation details, and our code is made available (https://github.com/sfujim/BCQ).
与以往任何连续控制深度强化学习算法不同,BCQ 能够成功学习而无需与环境交互,考虑外推误差。我们的算法在 MuJoCo 环境中的一系列批量强化学习任务中进行评估(Todorov 等,2012 年;Brockman 等,2016 年),其中由于高维连续动作空间,外推误差尤为棘手,无法进行全面采样。我们的算法提供了对模仿和离策略学习的统一视角,并能够从纯专家演示中学习,也能够从有限批次的次优数据中学习,无需进一步探索。我们指出,BCQ 只是在深度设置中处理批量约束强化学习的一种方式,希望它能成为未来算法的基础。为了确保可复现性,我们提供了精确的实验和实现细节,并提供了我们的代码(https://github.com/sfujim/BCQ)。

2 Background 背景

In reinforcement learning, an agent interacts with its environment, typically assumed to be a Markov decision process (MDP) (𝒮,𝒜,pM,r,γ)𝒮𝒜subscript𝑝𝑀𝑟𝛾(\mathcal{S},\mathcal{A},p_{M},r,\gamma), with state space 𝒮𝒮\mathcal{S}, action space 𝒜𝒜\mathcal{A}, and transition dynamics pM(s|s,a)subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{M}(s^{\prime}|s,a). At each discrete time step, the agent receives a reward r(s,a,s)𝑟𝑠𝑎superscript𝑠r(s,a,s^{\prime})\in\mathbb{R} for performing action a𝑎a in state s𝑠s and arriving at the state ssuperscript𝑠s^{\prime}. The goal of the agent is to maximize the expectation of the sum of discounted rewards, known as the return Rt=i=t+1γir(si,ai,si+1)subscript𝑅𝑡superscriptsubscript𝑖𝑡1superscript𝛾𝑖𝑟subscript𝑠𝑖subscript𝑎𝑖subscript𝑠𝑖1R_{t}=\sum_{i={t+1}}^{\infty}\gamma^{i}r(s_{i},a_{i},s_{i+1}), which weighs future rewards with respect to the discount factor γ[0,1)𝛾01\gamma\in[0,1).

The agent selects actions with respect to a policy π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A}, which exhibits a distribution μπ(s)superscript𝜇𝜋𝑠\mu^{\pi}(s) over the states s𝒮𝑠𝒮s\in\mathcal{S} visited by the policy. Each policy π𝜋\pi has a corresponding value function Qπ(s,a)=𝔼π[Rt|s,a]superscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]conditionalsubscript𝑅𝑡𝑠𝑎Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s,a], the expected return when following the policy after taking action a𝑎a in state s𝑠s. For a given policy π𝜋\pi, the value function can be computed through the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi}:
代理根据策略 π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A} 选择动作,该策略在被访问的状态 s𝒮𝑠𝒮s\in\mathcal{S} 上展现出一个分布 μπ(s)superscript𝜇𝜋𝑠\mu^{\pi}(s) 。每个策略 π𝜋\pi 都有一个相应的值函数 Qπ(s,a)=𝔼π[Rt|s,a]superscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]conditionalsubscript𝑅𝑡𝑠𝑎Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s,a] ,即在状态 s𝑠s 中采取动作 a𝑎a 后遵循该策略的预期回报。对于给定的策略 π𝜋\pi ,值函数可以通过贝尔曼算子 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi} 计算:

𝒯πQ(s,a)=𝔼s[r+γQ(s,π(s))].superscript𝒯𝜋𝑄𝑠𝑎subscript𝔼superscript𝑠delimited-[]𝑟𝛾𝑄superscript𝑠𝜋superscript𝑠\mathcal{T}^{\pi}Q(s,a)=\mathbb{E}_{s^{\prime}}[r+\gamma Q(s^{\prime},\pi(s^{\prime}))]. (1)

The Bellman operator is a contraction for γ[0,1)𝛾01\gamma\in[0,1) with unique fixed point Qπ(s,a)superscript𝑄𝜋𝑠𝑎Q^{\pi}(s,a) (Bertsekas & Tsitsiklis, 1996). Q(s,a)=maxπQπ(s,a)superscript𝑄𝑠𝑎subscript𝜋superscript𝑄𝜋𝑠𝑎Q^{*}(s,a)=\max_{\pi}Q^{\pi}(s,a) is known as the optimal value function, which has a corresponding optimal policy obtained through greedy action choices. For large or continuous state and action spaces, the value can be approximated with neural networks, e.g. using the Deep Q-Network algorithm (DQN) (Mnih et al., 2015). In DQN, the value function Qθsubscript𝑄𝜃Q_{\theta} is updated using the target:
贝尔曼算子是一个收缩映射,具有唯一的不动点(Bertsekas&Tsitsiklis,1996)。这个不动点被称为最优值函数,通过贪婪动作选择获得相应的最优策略。对于大型或连续的状态和动作空间,可以使用神经网络来近似值,例如使用深度 Q 网络算法(DQN)(Mnih 等,2015)。在 DQN 中,值函数通过目标进行更新:

r+γQθ(s,π(s)),π(s)=argmaxaQθ(s,a),𝑟𝛾subscript𝑄superscript𝜃superscript𝑠𝜋superscript𝑠𝜋superscript𝑠subscriptargmax𝑎subscript𝑄superscript𝜃superscript𝑠𝑎r+\gamma Q_{\theta^{\prime}}(s^{\prime},\pi(s^{\prime})),\quad\pi(s^{\prime})=\operatorname{argmax}_{a}Q_{\theta^{\prime}}(s^{\prime},a), (2)

Q-learning is an off-policy algorithm (Sutton & Barto, 1998), meaning the target can be computed without consideration of how the experience was generated. In principle, off-policy reinforcement learning algorithms are able to learn from data collected by any behavioral policy. Typically, the loss is minimized over mini-batches of tuples of the agent’s past data, (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})\in\mathcal{B}, sampled from an experience replay dataset \mathcal{B} (Lin, 1992). For shorthand, we often write s𝑠s\in\mathcal{B} if there exists a transition tuple containing s𝑠s in the batch \mathcal{B}, and similarly for (s,a)𝑠𝑎(s,a) or (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B}. In batch reinforcement learning, we assume \mathcal{B} is fixed and no further interaction with the environment occurs. To further stabilize learning, a target network with frozen parameters Qθsubscript𝑄superscript𝜃Q_{\theta^{\prime}}, is used in the learning target. The parameters of the target network θsuperscript𝜃\theta^{\prime} are updated to the current network parameters θ𝜃\theta after a fixed number of time steps, or by averaging θτθ+(1τ)θsuperscript𝜃𝜏𝜃1𝜏superscript𝜃\theta^{\prime}\leftarrow\tau\theta+(1-\tau)\theta^{\prime} for some small τ𝜏\tau (Lillicrap et al., 2015).
Q-learning 是一种离策略算法(Sutton&Barto,1998),意味着目标可以在不考虑经验生成方式的情况下计算。原则上,离策略强化学习算法能够从任何行为策略收集的数据中学习。通常,损失在代理过去数据的元组小批量 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})\in\mathcal{B}

In a continuous action space, the analytic maximum of Equation (2) is intractable. In this case, actor-critic methods are commonly used, where action selection is performed through a separate policy network πϕsubscript𝜋italic-ϕ\pi_{\phi}, known as the actor, and updated with respect to a value estimate, known as the critic (Sutton & Barto, 1998; Konda & Tsitsiklis, 2003). This policy can be updated following the deterministic policy gradient theorem (Silver et al., 2014):
在连续动作空间中,方程(2)的解析最大值是难以处理的。在这种情况下,通常使用演员-评论家方法,其中动作选择通过一个单独的策略网络(演员)进行,并根据价值估计(评论家)进行更新(Sutton&Barto,1998;Konda&Tsitsiklis,2003)。该策略可以根据确定性策略梯度定理进行更新(Silver 等人,2014):

ϕargmaxϕ𝔼s[Qθ(s,πϕ(s))],italic-ϕsubscriptargmaxitalic-ϕsubscript𝔼𝑠delimited-[]subscript𝑄𝜃𝑠subscript𝜋italic-ϕ𝑠\phi\leftarrow\operatorname{argmax}_{\phi}\mathbb{E}_{s\in\mathcal{B}}[Q_{\theta}(s,\pi_{\phi}(s))], (3)

which corresponds to learning an approximation to the maximum of Qθsubscript𝑄𝜃Q_{\theta}, by propagating the gradient through both πϕsubscript𝜋italic-ϕ\pi_{\phi} and Qθsubscript𝑄𝜃Q_{\theta}. When combined with off-policy deep Q-learning to learn Qθsubscript𝑄𝜃Q_{\theta}, this algorithm is referred to as Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015).
这对应于通过在 πϕsubscript𝜋italic-ϕ\pi_{\phi}Qθsubscript𝑄𝜃Q_{\theta} 中传播梯度来学习 Qθsubscript𝑄𝜃Q_{\theta} 的近似值。当与离策略深度 Q 学习相结合以学习 Qθsubscript𝑄𝜃Q_{\theta} 时,该算法被称为深度确定性策略梯度(DDPG)(Lillicrap 等,2015)。

3 Extrapolation Error 外推误差

Extrapolation error is an error in off-policy value learning which is introduced by the mismatch between the dataset and true state-action visitation of the current policy. The value estimate Q(s,a)𝑄𝑠𝑎Q(s,a) is affected by extrapolation error during a value update where the target policy selects an unfamiliar action asuperscript𝑎a^{\prime} at the next state ssuperscript𝑠s^{\prime} in the backed-up value estimate, such that (s,a)superscript𝑠superscript𝑎(s^{\prime},a^{\prime}) is unlikely, or not contained, in the dataset. The cause of extrapolation error can be attributed to several related factors:
外推误差是离线策略价值学习中的一个错误,由于数据集与当前策略的真实状态-动作访问不匹配而引入。在值更新过程中,值估计 Q(s,a)𝑄𝑠𝑎Q(s,a) 受到外推误差的影响,其中目标策略在备份的值估计中选择了下一个状态 ssuperscript𝑠s^{\prime} 的陌生动作 asuperscript𝑎a^{\prime} ,使得 (s,a)superscript𝑠superscript𝑎(s^{\prime},a^{\prime}) 不太可能或不包含在数据集中。外推误差的原因可以归因于几个相关因素:

Absent Data. If any state-action pair (s,a)𝑠𝑎(s,a) is unavailable, then error is introduced as some function of the amount of similar data and approximation error. This means that the estimate of Qθ(s,π(s))subscript𝑄𝜃superscript𝑠𝜋superscript𝑠Q_{\theta}(s^{\prime},\pi(s^{\prime})) may be arbitrarily bad without sufficient data near (s,π(s))superscript𝑠𝜋superscript𝑠(s^{\prime},\pi(s^{\prime})).
缺失数据。如果任何状态-动作对 (s,a)𝑠𝑎(s,a) 不可用,则会引入错误,该错误是某些相似数据量和近似误差的函数。这意味着在 (s,π(s))superscript𝑠𝜋superscript𝑠(s^{\prime},\pi(s^{\prime})) 附近没有足够的数据时,对 Qθ(s,π(s))subscript𝑄𝜃superscript𝑠𝜋superscript𝑠Q_{\theta}(s^{\prime},\pi(s^{\prime})) 的估计可能会任意糟糕。

Model Bias. When performing off-policy Q-learning with a batch \mathcal{B}, the Bellman operator 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi} is approximated by sampling transitions tuples (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) from \mathcal{B} to estimate the expectation over ssuperscript𝑠s^{\prime}. However, for a stochastic MDP, without infinite state-action visitation, this produces a biased estimate of the transition dynamics:
模型偏差。在使用批处理 \mathcal{B} 执行离策略 Q 学习时,Bellman 算子 𝒯πsuperscript𝒯𝜋\mathcal{T}^{\pi} 通过从 \mathcal{B} 中采样转换元组 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) 来近似,以估计 ssuperscript𝑠s^{\prime} 上的期望。然而,对于随机 MDP,没有无限状态-动作访问,这会产生对转换动态的有偏估计:

𝒯πQ(s,a)𝔼s[r+γQ(s,π(s))],superscript𝒯𝜋𝑄𝑠𝑎subscript𝔼similar-tosuperscript𝑠delimited-[]𝑟𝛾𝑄superscript𝑠𝜋superscript𝑠\mathcal{T}^{\pi}Q(s,a)\approx\mathbb{E}_{s^{\prime}\sim\mathcal{B}}[r+\gamma Q(s^{\prime},\pi(s^{\prime}))], (4)

where the expectation is with respect to transitions in the batch \mathcal{B}, rather than the true MDP.
期望是针对批次 \mathcal{B} 中的转换,而不是真实的 MDP。

Training Mismatch. Even with sufficient data, in deep Q-learning systems, transitions are sampled uniformly from the dataset, giving a loss weighted with respect to the likelihood of data in the batch:
训练不匹配。即使有足够的数据,在深度 Q 学习系统中,转换是从数据集中均匀抽样的,导致损失与批次数据的可能性加权。

1||(s,a,r,s)r+γQθ(s,π(s))Qθ(s,a)2.absent1subscript𝑠𝑎𝑟superscript𝑠superscriptnorm𝑟𝛾subscript𝑄superscript𝜃superscript𝑠𝜋superscript𝑠subscript𝑄𝜃𝑠𝑎2\approx\frac{1}{|\mathcal{B}|}\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}||r+\gamma Q_{\theta^{\prime}}(s^{\prime},\pi(s^{\prime}))-Q_{\theta}(s,a)||^{2}. (5)

If the distribution of data in the batch does not correspond with the distribution under the current policy, the value function may be a poor estimate of actions selected by the current policy, due to the mismatch in training.

We remark that re-weighting the loss in Equation (5) with respect to the likelihood under the current policy can still result in poor estimates if state-action pairs with high likelihood under the current policy are not found in the batch. This means only a subset of possible policies can be evaluated accurately. As a result, learning a value estimate with off-policy data can result in large amounts of extrapolation error if the policy selects actions which are not similar to the data found in the batch. In the following section, we discuss how state of the art off-policy deep reinforcement learning algorithms fail to address the concern of extrapolation error, and demonstrate the implications in practical examples.

3.1 Extrapolation Error in Deep Reinforcement Learning

Deep Q-learning algorithms (Mnih et al., 2015) have been labeled as off-policy due to their connection to off-policy Q-learning (Watkins, 1989). However, these algorithms tend to use near-on-policy exploratory policies, such as ϵitalic-ϵ\epsilon-greedy, in conjunction with a replay buffer (Lin, 1992). As a result, the generated dataset tends to be heavily correlated to the current policy. In this section, we examine how these off-policy algorithms perform when learning with uncorrelated datasets. Our results demonstrate that the performance of a state of the art deep actor-critic algorithm, DDPG (Lillicrap et al., 2015), deteriorates rapidly when the data is uncorrelated and the value estimate produced by the deep Q-network diverges. These results suggest that off-policy deep reinforcement learning algorithms are ineffective when learning truly off-policy.
深度 Q 学习算法(Mnih 等,2015 年)被标记为离策略,因为它们与离策略 Q 学习(Watkins,1989 年)有关。然而,这些算法往往使用接近策略的探索策略,如ε-贪心,结合重放缓冲区(Lin,1992 年)。因此,生成的数据集往往与当前策略高度相关。在本节中,我们将研究这些离策略算法在学习与不相关数据集时的表现。我们的结果表明,一种最先进的深度 actor-critic 算法 DDPG(Lillicrap 等,2015 年)在数据不相关且由深度 Q 网络产生的值估计发散时,性能会迅速下降。这些结果表明,当真正离策略学习时,离策略深度强化学习算法是无效的。

Our practical experiments examine three different batch settings in OpenAI gym’s Hopper-v1 environment (Todorov et al., 2012; Brockman et al., 2016), which we use to train an off-policy DDPG agent with no interaction with the environment. Experiments with additional environments and specific details can be found in the Supplementary Material.
我们的实验在 OpenAI gym 的 Hopper-v1 环境中检查了三种不同的批处理设置(Todorov 等,2012 年;Brockman 等,2016 年),我们使用这些设置来训练一个无需与环境交互的离线 DDPG 代理。有关其他环境和具体细节的实验可以在补充材料中找到。

Refer to caption
Refer to caption
(a) Final buffer  最终缓冲区
performance 表现
(b) Concurrent  并发
performance 表现
(c) Imitation  模仿
performance 表现
Refer to caption
(d) Final buffer  最终缓冲区
value estimate 价值估算
(e) Concurrent  并发
value estimate 价值估算
(f) Imitation  模仿
value estimate 价值估算
Figure 1: We examine the performance (top row) and corresponding value estimates (bottom row) of DDPG in three batch tasks on Hopper-v1. Each individual trial is plotted with a thin line, with the mean in bold (evaluated without exploration noise). Straight lines represent the average return of episodes contained in the batch (with exploration noise). An estimate of the true value of the off-policy agent, evaluated by Monte Carlo returns, is marked by a dotted line. In all three experiments, we observe a large gap in the performance between the behavioral and off-policy agent, even when learning from the same dataset (concurrent). Furthermore, the value estimates are unstable or divergent across all tasks.
图 1:我们在 Hopper-v1 上对 DDPG 在三个批处理任务中的性能(顶部行)和相应的值估计(底部行)进行了检查。每个单独的试验都用一条细线绘制,均值用粗体表示(在没有探索噪声的情况下评估)。直线代表批次中包含的剧集的平均回报(带有探索噪声)。通过蒙特卡洛回报评估的离线策略代理的真实值估计由虚线标记。在所有三个实验中,我们观察到行为代理和离线策略代理之间的性能差距很大,即使从相同数据集(并发)中学习。此外,所有任务中的值估计都不稳定或发散。

Batch 1 (Final buffer). We train a DDPG agent for 1 million time steps, adding 𝒩(0,0.5)𝒩00.5\mathcal{N}(0,0.5) Gaussian noise to actions for high exploration, and store all experienced transitions. This collection procedure creates a dataset with a diverse set of states and actions, with the aim of sufficient coverage.
批次 1(最终缓冲区)。我们训练一个 DDPG 代理器进行 100 万个时间步,对动作添加高探索性的高斯噪声,并存储所有经验转换。这个收集过程创建了一个包含各种状态和动作的数据集,旨在实现充分覆盖。

Batch 2 (Concurrent). We concurrently train the off-policy and behavioral DDPG agents, for 1 million time steps. To ensure sufficient exploration, a standard 𝒩(0,0.1)𝒩00.1\mathcal{N}(0,0.1) Gaussian noise is added to actions taken by the behavioral policy. Each transition experienced by the behavioral policy is stored in a buffer replay, which both agents learn from. As a result, both agents are trained with the identical dataset.
第二批(并发)。我们同时训练离策略和行为 DDPG 代理,进行 100 万个时间步。为了确保足够的探索,标准的高斯噪声被添加到行为策略采取的动作中。行为策略经历的每个转换都存储在一个缓冲重放中,两个代理都从中学习。因此,两个代理都使用相同的数据集进行训练。

Batch 3 (Imitation). A trained DDPG agent acts as an expert, and is used to collect a dataset of 1 million transitions.
第三批(模拟)。经过训练的 DDPG 代理充当专家,并用于收集 100 万个转换的数据集。

In Figure 1, we graph the performance of the agents as they train with each batch, as well as their value estimates. Straight lines represent the average return of episodes contained in the batch. Additionally, we graph the learning performance of the behavioral agent for the relevant tasks.
在图 1 中,我们绘制了代理随着每个批次训练的表现,以及它们的价值估计。直线代表批次中包含的剧集的平均回报。此外,我们还绘制了行为代理在相关任务中的学习表现。

Our experiments demonstrate several surprising facts about off-policy deep reinforcement learning agents. In each task, the off-policy agent performances significantly worse than the behavioral agent. Even in the concurrent experiment, where both agents are trained with the same dataset, there is a large gap in performance in every single trial. This result suggests that differences in the state distribution under the initial policies is enough for extrapolation error to drastically offset the performance of the off-policy agent. Additionally, the corresponding value estimate exhibits divergent behavior, while the value estimate of the behavioral agent is highly stable. In the final buffer experiment, the off-policy agent is provided with a large and diverse dataset, with the aim of providing sufficient coverage of the initial policy. Even in this instance, the value estimate is highly unstable, and the performance suffers. In the imitation setting, the agent is provided with expert data. However, the agent quickly learns to take non-expert actions, under the guise of optimistic extrapolation. As a result, the value estimates rapidly diverge and the agent fails to learn.

Although extrapolation error is not necessarily positively biased, when combined with maximization in reinforcement learning algorithms, extrapolation error provides a source of noise that can induce a persistent overestimation bias (Thrun & Schwartz, 1993; Van Hasselt et al., 2016; Fujimoto et al., 2018). In an on-policy setting, extrapolation error may be a source of beneficial exploration through an implicit “optimism in the face of uncertainty” strategy (Lai & Robbins, 1985; Jaksch et al., 2010). In this case, if the value function overestimates an unknown state-action pair, the policy will collect data in the region of uncertainty, and the value estimate will be corrected. However, when learning off-policy, or in a batch setting, extrapolation error will never be corrected due to the inability to collect new data.

These experiments show extrapolation error can be highly detrimental to learning off-policy in a batch reinforcement learning setting. While the continuous state space and multi-dimensional action space in MuJoCo environments are contributing factors to extrapolation error, the scale of these tasks is small compared to real world settings. As a result, even with a sufficient amount of data collection, extrapolation error may still occur due to the concern of catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2013). Consequently, off-policy reinforcement learning algorithms used in the real-world will require practical guarantees without exhaustive amounts of data.
这些实验表明,在批量强化学习设置中,外推误差可能对离线学习产生严重不利影响。虽然 MuJoCo 环境中的连续状态空间和多维动作空间是外推误差的 contributing factors,但与真实世界环境相比,这些任务的规模较小。因此,即使有足够的数据收集,由于对灾难性遗忘的担忧,外推误差仍可能发生。因此,在真实世界中使用的离线强化学习算法将需要在不需要大量数据的情况下提供实际保证。

4 Batch-Constrained Reinforcement Learning

Current off-policy deep reinforcement learning algorithms fail to address extrapolation error by selecting actions with respect to a learned value estimate, without consideration of the accuracy of the estimate. As a result, certain out-of-distribution actions can be erroneously extrapolated to higher values. However, the value of an off-policy agent can be accurately evaluated in regions where data is available. We propose a conceptually simple idea: to avoid extrapolation error a policy should induce a similar state-action visitation to the batch. We denote policies which satisfy this notion as batch-constrained. To optimize off-policy learning for a given batch, batch-constrained policies are trained to select actions with respect to three objectives:

  1. (1)

    Minimize the distance of selected actions to the data in the batch.

  2. (2)

    Lead to states where familiar data can be observed.

  3. (3)

    Maximize the value function.


We note the importance of objective (1) above the others, as the value function and estimates of future states may be arbitrarily poor without access to the corresponding transitions. That is, we cannot correctly estimate (2) and (3) unless (1) is sufficiently satisfied. As a result, we propose optimizing the value function, along with some measure of future certainty, with a constraint limiting the distance of selected actions to the batch. This is achieved in our deep reinforcement learning algorithm through a state-conditioned generative model, to produce likely actions under the batch. This generative model is combined with a network which aims to optimally perturb the generated actions in a small range, along with a Q-network, used to select the highest valued action. Finally, we train a pair of Q-networks, and take the minimum of their estimates during the value update. This update penalizes states which are unfamiliar, and pushes the policy to select actions which lead to certain data.
我们注意到目标(1)的重要性高于其他目标,因为价值函数和未来状态的估计可能会非常糟糕,如果没有对应转换的话。也就是说,除非目标(1)得到充分满足,否则我们无法正确估计(2)和(3)。因此,我们建议优化价值函数,以及一些未来确定性的度量,同时通过限制所选动作与批处理之间的距离来实现。在我们的深度强化学习算法中,通过一个状态条件的生成模型来实现这一点,以在批处理下生成可能的动作。这个生成模型与一个网络结合在一起,旨在在一个小范围内最优地扰动生成的动作,同时还有一个 Q 网络,用于选择价值最高的动作。最后,我们训练一对 Q 网络,并在价值更新过程中取它们估计的最小值。这种更新惩罚了不熟悉的状态,并推动策略选择导致确定数据的动作。

We begin by analyzing the theoretical properties of batch-constrained policies in a finite MDP setting, where we are able to quantify extrapolation error precisely. We then introduce our deep reinforcement learning algorithm in detail, Batch-Constrained deep Q-learning (BCQ) by drawing inspiration from the tabular analogue.
我们首先分析了在有限 MDP 环境中批量约束策略的理论特性,我们能够精确量化外推误差。然后,我们通过从表格模拟中汲取灵感,详细介绍了我们的深度强化学习算法,批量约束深度 Q 学习(BCQ)。

4.1 Addressing Extrapolation Error in Finite MDPs
在有限 MDP 中解决外推误差

In the finite MDP setting, extrapolation error can be described by the bias from the mismatch between the transitions contained in the buffer and the true MDP. We find that by inducing a data distribution that is contained entirely within the batch, batch-constrained policies can eliminate extrapolation error entirely for deterministic MDPs. In addition, we show that the batch-constrained variant of Q-learning converges to the optimal policy under the same conditions as the standard form of Q-learning. Moreover, we prove that for a deterministic MDP, batch-constrained Q-learning is guaranteed to match, or outperform, the behavioral policy when starting from any state contained in the batch. All of the proofs for this section can be found in the Supplementary Material.
在有限 MDP 设置中,外推误差可以被描述为缓冲区中包含的转换与真实 MDP 之间的不匹配所导致的偏差。我们发现,通过引入一个完全包含在批处理中的数据分布,批处理约束策略可以完全消除确定性 MDP 的外推误差。此外,我们证明了 Q-learning 的批处理约束变体在与标准形式的 Q-learning 相同的条件下收敛到最优策略。此外,我们证明了对于确定性 MDP,批处理约束 Q-learning 在从批处理中的任何状态开始时保证与或优于行为策略相匹配。本节的所有证明都可以在补充材料中找到。

A value estimate Q𝑄Q can be learned using an experience replay buffer \mathcal{B}. This involves sampling transition tuples (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) with uniform probability, and applying the temporal difference update (Sutton, 1988; Watkins, 1989):
可以使用经验重放缓冲区学习价值估计。这涉及使用均匀概率对转换元组进行抽样,并应用时间差异更新(Sutton, 1988; Watkins, 1989):

Q(s,a)(1α)Q(s,a)+α(r+γQ(s,π(s))).𝑄𝑠𝑎1𝛼𝑄𝑠𝑎𝛼𝑟𝛾𝑄superscript𝑠𝜋superscript𝑠Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha\left(r+\gamma Q(s^{\prime},\pi(s^{\prime}))\right). (6)

If π(s)=argmaxaQ(s,a)𝜋superscript𝑠subscriptargmaxsuperscript𝑎𝑄superscript𝑠superscript𝑎\pi(s^{\prime})=\operatorname*{argmax}_{a^{\prime}}Q(s^{\prime},a^{\prime}), this is known as Q-learning. Assuming a non-zero probability of sampling any possible transition tuple from the buffer and infinite updates, Q-learning converges to the optimal value function.
如果是这样,这就被称为 Q 学习。假设从缓冲区中采样任何可能的转移元组的概率不为零,并且更新无限次,Q 学习将收敛到最优值函数。

We begin by showing that the value function Q𝑄Q learned with the batch \mathcal{B} corresponds to the value function for an alternate MDP Msubscript𝑀M_{\mathcal{B}}. From the true MDP M𝑀M and initial values Q(s,a)𝑄𝑠𝑎Q(s,a), we define the new MDP Msubscript𝑀M_{\mathcal{B}} with the same action and state space as M𝑀M, along with an additional terminal state sinitsubscript𝑠inits_{\text{init}}. Msubscript𝑀M_{\mathcal{B}} has transition probabilities p(s|s,a)=N(s,a,s)s~N(s,a,s~)subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑁𝑠𝑎superscript𝑠subscript~𝑠𝑁𝑠𝑎~𝑠p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}, where N(s,a,s)𝑁𝑠𝑎superscript𝑠N(s,a,s^{\prime}) is the number of times the tuple (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime}) is observed in \mathcal{B}. If s~N(s,a,s~)=0subscript~𝑠𝑁𝑠𝑎~𝑠0\sum_{\tilde{s}}N(s,a,\tilde{s})=0, then p(sinit|s,a)=1subscript𝑝conditionalsubscript𝑠init𝑠𝑎1p_{\mathcal{B}}(s_{\text{init}}|s,a)=1, where r(s,a,sinit)𝑟𝑠𝑎subscript𝑠initr(s,a,s_{\text{init}}) is set to the initialized value of Q(s,a)𝑄𝑠𝑎Q(s,a).
我们首先展示,使用批处理学习的值函数 Q𝑄Q 对应于备用 MDP Msubscript𝑀M_{\mathcal{B}} 的值函数。从真实的 MDP M𝑀M 和初始值 Q(s,a)𝑄𝑠𝑎Q(s,a) 开始,我们定义具有与 M𝑀M 相同动作和状态空间的新 MDP Msubscript𝑀M_{\mathcal{B}} ,以及一个额外的终止状态 sinitsubscript𝑠inits_{\text{init}}Msubscript𝑀M_{\mathcal{B}} 具有转移概率 p(s|s,a)=N(s,a,s)s~N(s,a,s~)subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑁𝑠𝑎superscript𝑠subscript~𝑠𝑁𝑠𝑎~𝑠p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})} ,其中 N(s,a,s)𝑁𝑠𝑎superscript𝑠N(s,a,s^{\prime}) 是元组 (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\mathcal{B} 中观察到的次数。如果 s~N(s,a,s~)=0subscript~𝑠𝑁𝑠𝑎~𝑠0\sum_{\tilde{s}}N(s,a,\tilde{s})=0 ,则 p(sinit|s,a)=1subscript𝑝conditionalsubscript𝑠init𝑠𝑎1p_{\mathcal{B}}(s_{\text{init}}|s,a)=1 ,其中 r(s,a,sinit)𝑟𝑠𝑎subscript𝑠initr(s,a,s_{\text{init}}) 设置为 Q(s,a)𝑄𝑠𝑎Q(s,a) 的初始化值。

Theorem 1. Performing Q-learning by sampling from a batch \mathcal{B} converges to the optimal value function under the MDP Msubscript𝑀M_{\mathcal{B}}.
定理 1. 通过从批次 \mathcal{B} 中抽样执行 Q-learning 收敛到 MDP Msubscript𝑀M_{\mathcal{B}} 下的最优值函数。

We define ϵMDPsubscriptitalic-ϵMDP\epsilon_{\text{MDP}} as the tabular extrapolation error, which accounts for the discrepancy between the value function Qπsubscriptsuperscript𝑄𝜋Q^{\pi}_{\mathcal{B}} computed with the batch \mathcal{B} and the value function Qπsuperscript𝑄𝜋Q^{\pi} computed with the true MDP M𝑀M:
我们将 ϵMDPsubscriptitalic-ϵMDP\epsilon_{\text{MDP}} 定义为表格外推误差,它考虑了使用批处理 \mathcal{B} 计算的值函数 Qπsubscriptsuperscript𝑄𝜋Q^{\pi}_{\mathcal{B}} 与使用真实 MDP M𝑀M 计算的值函数 Qπsuperscript𝑄𝜋Q^{\pi} 之间的差异:

ϵMDP(s,a)=Qπ(s,a)Qπ(s,a).subscriptitalic-ϵMDP𝑠𝑎superscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a). (7)

For any policy π𝜋\pi, the exact form of ϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎\epsilon_{\text{MDP}}(s,a) can be computed through a Bellman-like equation:
对于任何政策 π𝜋\pi ,可以通过类似贝尔曼方程的方式计算出 ϵMDP(s,a)subscriptitalic-ϵMDP𝑠𝑎\epsilon_{\text{MDP}}(s,a) 的确切形式:

ϵMDP(s,a)=s(pM(s|s,a)p(s|s,a))(r(s,a,s)+γaπ(a|s)Qπ(s,a))+pM(s|s,a)γaπ(a|s)ϵMDP(s,a).subscriptitalic-ϵMDP𝑠𝑎subscriptsuperscript𝑠subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎subscript𝑝conditionalsuperscript𝑠𝑠𝑎𝑟𝑠𝑎superscript𝑠𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptsuperscript𝑄𝜋superscript𝑠superscript𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎𝛾subscriptsuperscript𝑎𝜋conditionalsuperscript𝑎superscript𝑠subscriptitalic-ϵMDPsuperscript𝑠superscript𝑎\begin{split}\epsilon_{\text{MDP}}(s,a)~{}&=\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\\ &\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime}).\end{split} (8)

This means extrapolation error is a function of divergence in the transition distributions, weighted by value, along with the error at succeeding states. If the policy is chosen carefully, the error between value functions can be minimized by visiting regions where the transition distributions are similar. For simplicity, we denote

ϵMDPπ=sμπ(s)aπ(a|s)|ϵMDP(s,a)|.subscriptsuperscriptitalic-ϵ𝜋MDPsubscript𝑠subscript𝜇𝜋𝑠subscript𝑎𝜋conditional𝑎𝑠subscriptitalic-ϵMDP𝑠𝑎\epsilon^{\pi}_{\text{MDP}}=\sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)|\epsilon_{\text{MDP}}(s,a)|. (9)

To evaluate a policy π𝜋\pi exactly at relevant state-action pairs, only ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 is required. We can then determine the condition required to evaluate the exact expected return of a policy without extrapolation error.
要在相关的状态-动作对上精确评估一项政策,只需要 ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 。然后我们可以确定评估一项政策的精确预期回报所需的条件,而不会出现外推误差。

Lemma 1. For all reward functions, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 if and only if p(s|s,a)=pM(s|s,a)subscript𝑝conditionalsuperscript𝑠𝑠𝑎subscript𝑝𝑀conditionalsuperscript𝑠𝑠𝑎p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a) for all s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S} and (s,a)𝑠𝑎(s,a) such that μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0 and π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0.
引理 1. 对于所有的奖励函数, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 当且仅当对于所有的 s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}(s,a)𝑠𝑎(s,a) ,满足 μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0

Lemma 1 states that if Msubscript𝑀M_{\mathcal{B}} and M𝑀M exhibit the same transition probabilities in regions of relevance, the policy can be accurately evaluated. For a stochastic MDP this may require an infinite number of samples to converge to the true distribution, however, for a deterministic MDP this requires only a single transition. This means a policy which only traverses transitions contained in the batch, can be evaluated without error. More formally, we denote a policy πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} as batch-constrained if for all (s,a)𝑠𝑎(s,a) where μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0 and π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0 then (s,a)𝑠𝑎(s,a)\in\mathcal{B}. Additionally, we define a batch \mathcal{B} as coherent if for all (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} then ssuperscript𝑠s^{\prime}\in\mathcal{B} unless ssuperscript𝑠s^{\prime} is a terminal state. This condition is trivially satisfied if the data is collected in trajectories, or if all possible states are contained in the batch. With a coherent batch, we can guarantee the existence of a batch-constrained policy.
引理 1 指出,如果 Msubscript𝑀M_{\mathcal{B}}M𝑀M 在相关区域展示相同的转移概率,那么可以准确评估策略。对于随机 MDP,这可能需要无限数量的样本才能收敛到真实分布,然而对于确定性 MDP,只需要一个转移。这意味着只遍历批次中包含的转移的策略可以无误地进行评估。更正式地,我们将一个策略 πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} 称为批次受限的,如果对于所有 (s,a)𝑠𝑎(s,a) ,其中 μπ(s)>0subscript𝜇𝜋𝑠0\mu_{\pi}(s)>0π(a|s)>0𝜋conditional𝑎𝑠0\pi(a|s)>0 ,则 (s,a)𝑠𝑎(s,a)\in\mathcal{B} 。此外,我们定义一个批次 \mathcal{B} 为一致的,如果对于所有 (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} ,则 ssuperscript𝑠s^{\prime}\in\mathcal{B} ,除非 ssuperscript𝑠s^{\prime} 是一个终止状态。如果数据是按轨迹收集的,或者所有可能的状态都包含在批次中,则这个条件显然是满足的。有了一致的批次,我们可以保证存在一个受批次限制的策略的存在。

Theorem 2. For a deterministic MDP and all reward functions, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 if and only if the policy π𝜋\pi is batch-constrained. Furthermore, if \mathcal{B} is coherent, then such a policy must exist if the start state s0subscript𝑠0s_{0}\in\mathcal{B}.
定理 2. 对于一个确定性 MDP 和所有奖励函数, ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 当且仅当策略 π𝜋\pi 是批次受限的。此外,如果 \mathcal{B} 是连贯的,那么这样的策略必须存在,如果起始状态 s0subscript𝑠0s_{0}\in\mathcal{B}

Batch-constrained policies can be used in conjunction with Q-learning to form batch-constrained Q-learning (BCQL), which follows the standard tabular Q-learning update while constraining the possible actions with respect to the batch:
批量约束策略可以与 Q 学习结合使用,形成批量约束 Q 学习(BCQL),其遵循标准的表格 Q 学习更新,同时限制了与批量相关的可能动作:

Q(s,a)(1α)Q(s,a)+α(r+γmaxas.t.(s,a)Q(s,a)).𝑄𝑠𝑎1𝛼𝑄𝑠𝑎𝛼𝑟𝛾subscriptsuperscript𝑎s.t.superscript𝑠superscript𝑎𝑄superscript𝑠superscript𝑎Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}\text{s.t.}(s^{\prime},a^{\prime})\in\mathcal{B}}Q(s^{\prime},a^{\prime})). (10)

BCQL converges under the same conditions as the standard form of Q-learning, noting the batch-constraint is nonrestrictive given infinite state-action visitation.
BCQL 在相同条件下收敛,注意到批量约束在状态-动作访问无限时是非限制性的。

Theorem 3. Given the Robbins-Monro stochastic convergence conditions on the learning rate α𝛼\alpha, and standard sampling requirements from the environment, BCQL converges to the optimal value function Qsuperscript𝑄Q^{*}.
定理 3. 在学习率 α𝛼\alpha

The more interesting property of BCQL is that for a deterministic MDP and any coherent batch \mathcal{B}, BCQL converges to the optimal batch-constrained policy πΠsuperscript𝜋subscriptΠ\pi^{*}\in\Pi_{\mathcal{B}} such that Qπ(s,a)Qπ(s,a)superscript𝑄superscript𝜋𝑠𝑎superscript𝑄𝜋𝑠𝑎Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a) for all πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} and (s,a)𝑠𝑎(s,a)\in\mathcal{B}.
BCQL 的更有趣的特性是,对于一个确定性 MDP 和任何一致的批次 \mathcal{B}

Theorem 4. Given a deterministic MDP and coherent batch \mathcal{B}, along with the Robbins-Monro stochastic convergence conditions on the learning rate α𝛼\alpha and standard sampling requirements on the batch \mathcal{B}, BCQL converges to Qπ(s,a)subscriptsuperscript𝑄𝜋𝑠𝑎Q^{\pi}_{\mathcal{B}}(s,a) where π(s)=argmaxa s.t.(s,a)Qπ(s,a)superscript𝜋𝑠subscriptargmax𝑎 s.t.𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\pi^{*}(s)=\operatorname*{argmax}_{a\text{ s.t.}(s,a)\in\mathcal{B}}Q^{\pi}_{\mathcal{B}}(s,a) is the optimal batch-constrained policy.
定理 4. 给定确定性 MDP 和一致的批次 \mathcal{B}

This means that BCQL is guaranteed to outperform any behavioral policy when starting from any state contained in the batch, effectively outperforming imitation learning. Unlike standard Q-learning, there is no condition on state-action visitation, other than coherency in the batch.
这意味着 BCQL 保证在从批处理中的任何状态开始时优于任何行为策略,有效地超越了模仿学习。与标准 Q 学习不同,除了批处理中的一致性外,对状态-动作访问没有条件。

4.2 Batch-Constrained Deep Reinforcement Learning
4.2 批次约束深度强化学习

We introduce our approach to off-policy batch reinforcement learning, Batch-Constrained deep Q-learning (BCQ). BCQ approaches the notion of batch-constrained through a generative model. For a given state, BCQ generates plausible candidate actions with high similarity to the batch, and then selects the highest valued action through a learned Q-network. Furthermore, we bias this value estimate to penalize rare, or unseen, states through a modification to Clipped Double Q-learning (Fujimoto et al., 2018). As a result, BCQ learns a policy with a similar state-action visitation to the data in the batch, as inspired by the theoretical benefits of its tabular counterpart.
我们介绍了我们的离线批量强化学习方法,批量约束深度 Q 学习(BCQ)。BCQ 通过生成模型来处理批量约束的概念。对于给定的状态,BCQ 生成与批量高度相似的可信候选动作,然后通过学习的 Q 网络选择价值最高的动作。此外,我们通过对 Clipped Double Q 学习(Fujimoto 等人,2018)进行修改,对罕见或未见状态进行惩罚以偏置这个价值估计。因此,BCQ 学习了一个与批量数据中的状态-动作访问类似的策略,受到其表格对应物理论优势的启发。

To maintain the notion of batch-constraint, we define a similarity metric by making the assumption that for a given state s𝑠s, the similarity between (s,a)𝑠𝑎(s,a) and the state-action pairs in the batch \mathcal{B} can be modelled using a learned state-conditioned marginal likelihood PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s). In this case, it follows that the policy maximizing PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) would minimize the error induced by extrapolation from distant, or unseen, state-action pairs, by only selecting the most likely actions in the batch with respect to a given state. Given the difficulty of estimating PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) in high-dimensional continuous spaces, we instead train a parametric generative model of the batch Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s), which we can sample actions from, as a reasonable approximation to argmaxaPG(a|s)subscriptargmax𝑎superscriptsubscript𝑃𝐺conditional𝑎𝑠\operatorname*{argmax}_{a}P_{\mathcal{B}}^{G}(a|s).
为了保持批次约束的概念,我们通过假设对于给定状态 s𝑠s ,定义了一个相似度度量,该度量可以用学习的状态条件边际似然 PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) 来建模 (s,a)𝑠𝑎(s,a) 和批次 \mathcal{B} 中的状态-动作对之间的相似度。在这种情况下,遵循这样一个策略,最大化 PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) 的策略将通过仅选择相对于给定状态在批次中最可能的动作来最小化由远距离或未见的状态-动作对引起的误差。鉴于在高维连续空间中估计 PG(a|s)superscriptsubscript𝑃𝐺conditional𝑎𝑠P_{\mathcal{B}}^{G}(a|s) 的困难,我们改为训练批次 Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) 的参数生成模型,我们可以从中采样动作,作为 argmaxaPG(a|s)subscriptargmax𝑎superscriptsubscript𝑃𝐺conditional𝑎𝑠\operatorname*{argmax}_{a}P_{\mathcal{B}}^{G}(a|s) 的合理近似。

For our generative model we use a conditional variational auto-encoder (VAE) (Kingma & Welling, 2013; Sohn et al., 2015), which models the distribution by transforming an underlying latent space111See the Supplementary Material for an introduction to VAEs.. The generative model Gωsubscript𝐺𝜔G_{\omega}, alongside the value function Qθsubscript𝑄𝜃Q_{\theta}, can be used as a policy by sampling n𝑛n actions from Gωsubscript𝐺𝜔G_{\omega} and selecting the highest valued action according to the value estimate Qθsubscript𝑄𝜃Q_{\theta}. To increase the diversity of seen actions, we introduce a perturbation model ξϕ(s,a,Φ)subscript𝜉italic-ϕ𝑠𝑎Φ\xi_{\phi}(s,a,\Phi), which outputs an adjustment to an action a𝑎a in the range [Φ,Φ]ΦΦ[-\Phi,\Phi]. This enables access to actions in a constrained region, without having to sample from the generative model a prohibitive number of times. This results in the policy π𝜋\pi:
对于我们的生成模型,我们使用了一个条件变分自动编码器(VAE)(Kingma&Welling,2013 年; Sohn 等,2015 年),它通过转换潜在的潜在空间来建模分布。生成模型,连同价值函数,可以作为策略使用,通过从中采样动作并根据价值估计选择价值最高的动作。为了增加所见动作的多样性,我们引入了一个扰动模型,它输出一个在范围内的动作调整。这使得可以访问受限区域内的动作,而无需从生成模型中采样过多次。这导致了策略:

π(s)=argmaxai+ξϕ(s,ai,Φ)Qθ(s,ai+ξϕ(s,ai,Φ)),{aiGω(s)}i=1n.𝜋𝑠subscriptargmaxsubscript𝑎𝑖subscript𝜉italic-ϕ𝑠subscript𝑎𝑖Φsubscript𝑄𝜃𝑠subscript𝑎𝑖subscript𝜉italic-ϕ𝑠subscript𝑎𝑖Φsuperscriptsubscriptsimilar-tosubscript𝑎𝑖subscript𝐺𝜔𝑠𝑖1𝑛\begin{split}\pi(s)=&\operatorname*{argmax}_{a_{i}+\xi_{\phi}(s,a_{i},\Phi)}Q_{\theta}(s,a_{i}+\xi_{\phi}(s,a_{i},\Phi)),\\ &\{a_{i}\sim G_{\omega}(s)\}_{i=1}^{n}.\end{split} (11)

The choice of n𝑛n and ΦΦ\Phi creates a trade-off between an imitation learning and reinforcement learning algorithm. If Φ=0Φ0\Phi=0, and the number of sampled actions n=1𝑛1n=1, then the policy resembles behavioral cloning and as ΦamaxaminΦsubscript𝑎maxsubscript𝑎min\Phi\rightarrow a_{\text{max}}-a_{\text{min}} and n𝑛n\rightarrow\infty, then the algorithm approaches Q-learning, as the policy begins to greedily maximize the value function over the entire action space.
选择 n𝑛nΦΦ\Phi 之间存在着模仿学习和强化学习算法之间的权衡。如果 Φ=0Φ0\Phi=0 ,以及采样动作的数量 n=1𝑛1n=1 ,那么策略类似于行为克隆,当 ΦamaxaminΦsubscript𝑎maxsubscript𝑎min\Phi\rightarrow a_{\text{max}}-a_{\text{min}}n𝑛n\rightarrow\infty 时,该算法接近 Q 学习,因为策略开始贪婪地最大化整个动作空间的值函数。

Algorithm 1 BCQ 算法 1 BCQ
  Input: Batch \mathcal{B}, horizon T𝑇T, target network update rate τ𝜏\tau, mini-batch size N𝑁N, max perturbation ΦΦ\Phi, number of sampled actions n𝑛n, minimum weighting λ𝜆\lambda.
输入:批次 \mathcal{B} ,时间跨度 T𝑇T ,目标网络更新率 τ𝜏\tau ,小批量大小 N𝑁N ,最大扰动 ΦΦ\Phi ,采样动作数量 n𝑛n ,最小加权 λ𝜆\lambda
  Initialize Q-networks Qθ1,Qθ2subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2Q_{\theta_{1}},Q_{\theta_{2}}, perturbation network ξϕsubscript𝜉italic-ϕ\xi_{\phi}, and VAE Gω={Eω1,Dω2}subscript𝐺𝜔subscript𝐸subscript𝜔1subscript𝐷subscript𝜔2G_{\omega}=\{E_{\omega_{1}},D_{\omega_{2}}\}, with random parameters θ1subscript𝜃1\theta_{1}, θ2subscript𝜃2\theta_{2}, ϕitalic-ϕ\phi, ω𝜔\omega, and target networks Qθ1,Qθ2subscript𝑄subscriptsuperscript𝜃1subscript𝑄subscriptsuperscript𝜃2Q_{\theta^{\prime}_{1}},Q_{\theta^{\prime}_{2}}, ξϕsubscript𝜉superscriptitalic-ϕ\xi_{\phi^{\prime}} with θ1θ1,θ2θ2formulae-sequencesubscriptsuperscript𝜃1subscript𝜃1subscriptsuperscript𝜃2subscript𝜃2\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}, ϕϕsuperscriptitalic-ϕitalic-ϕ\phi^{\prime}\leftarrow\phi.
使用随机参数 θ1subscript𝜃1\theta_{1}θ2subscript𝜃2\theta_{2}ϕitalic-ϕ\phiω𝜔\omega 初始化 Q 网络 Qθ1,Qθ2subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2Q_{\theta_{1}},Q_{\theta_{2}} ,扰动网络 ξϕsubscript𝜉italic-ϕ\xi_{\phi} 和 VAE Gω={Eω1,Dω2}subscript𝐺𝜔subscript𝐸subscript𝜔1subscript𝐷subscript𝜔2G_{\omega}=\{E_{\omega_{1}},D_{\omega_{2}}\} ,以及目标网络 Qθ1,Qθ2subscript𝑄subscriptsuperscript𝜃1subscript𝑄subscriptsuperscript𝜃2Q_{\theta^{\prime}_{1}},Q_{\theta^{\prime}_{2}}ξϕsubscript𝜉superscriptitalic-ϕ\xi_{\phi^{\prime}} ,带有 θ1θ1,θ2θ2formulae-sequencesubscriptsuperscript𝜃1subscript𝜃1subscriptsuperscript𝜃2subscript𝜃2\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}ϕϕsuperscriptitalic-ϕitalic-ϕ\phi^{\prime}\leftarrow\phi
  for t=1𝑡1t=1 to T𝑇T do
t=1𝑡1t=1T𝑇T 进行
     Sample mini-batch of N𝑁N transitions (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) from \mathcal{B}
\mathcal{B} 中抽取 N𝑁N 个转换的样本小批次 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})
     μ,σ=Eω1(s,a),a~=Dω2(s,z),z𝒩(μ,σ)formulae-sequence𝜇𝜎subscript𝐸subscript𝜔1𝑠𝑎formulae-sequence~𝑎subscript𝐷subscript𝜔2𝑠𝑧similar-to𝑧𝒩𝜇𝜎\mu,\sigma=E_{\omega_{1}}(s,a),\quad\tilde{a}=D_{\omega_{2}}(s,z),\quad z\sim\mathcal{N}(\mu,\sigma)
     Sample n𝑛n actions: {aiGω(s)}i=1nsuperscriptsubscriptsimilar-tosubscript𝑎𝑖subscript𝐺𝜔superscript𝑠𝑖1𝑛\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=1}^{n}
示例 n𝑛n 操作: {aiGω(s)}i=1nsuperscriptsubscriptsimilar-tosubscript𝑎𝑖subscript𝐺𝜔superscript𝑠𝑖1𝑛\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=1}^{n}
     Perturb each action: {ai=ai+ξϕ(s,ai,Φ)}i=1nsuperscriptsubscriptsubscript𝑎𝑖subscript𝑎𝑖subscript𝜉italic-ϕsuperscript𝑠subscript𝑎𝑖Φ𝑖1𝑛\{a_{i}=a_{i}+\xi_{\phi}(s^{\prime},a_{i},\Phi)\}_{i=1}^{n}
扰动每个动作: {ai=ai+ξϕ(s,ai,Φ)}i=1nsuperscriptsubscriptsubscript𝑎𝑖subscript𝑎𝑖subscript𝜉italic-ϕsuperscript𝑠subscript𝑎𝑖Φ𝑖1𝑛\{a_{i}=a_{i}+\xi_{\phi}(s^{\prime},a_{i},\Phi)\}_{i=1}^{n}
     Set value target y𝑦y (Eqn. 13)
设置目标值 y𝑦y (方程式 13)
     ϕargmaxϕQθ1(s,a+ξϕ(s,a,Φ)),aGω(s)formulae-sequenceitalic-ϕsubscriptargmaxitalic-ϕsubscript𝑄subscript𝜃1𝑠𝑎subscript𝜉italic-ϕ𝑠𝑎Φsimilar-to𝑎subscript𝐺𝜔𝑠\phi\leftarrow\operatorname*{argmax}_{\phi}\sum Q_{\theta_{1}}(s,a+\xi_{\phi}(s,a,\Phi)),a\sim G_{\omega}(s)
     Update target networks: θiτθ+(1τ)θisubscriptsuperscript𝜃𝑖𝜏𝜃1𝜏subscriptsuperscript𝜃𝑖\theta^{\prime}_{i}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}_{i}
更新目标网络: θiτθ+(1τ)θisubscriptsuperscript𝜃𝑖𝜏𝜃1𝜏subscriptsuperscript𝜃𝑖\theta^{\prime}_{i}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}_{i}
  end for 结束循环

The perturbation model ξϕsubscript𝜉italic-ϕ\xi_{\phi} can be trained to maximize Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a) through the deterministic policy gradient algorithm (Silver et al., 2014) by sampling aGω(s)similar-to𝑎subscript𝐺𝜔𝑠a\sim G_{\omega}(s):
扰动模型 ξϕsubscript𝜉italic-ϕ\xi_{\phi} 可以通过采样 aGω(s)similar-to𝑎subscript𝐺𝜔𝑠a\sim G_{\omega}(s) ,通过确定性策略梯度算法(Silver 等人,2014 年)进行训练以最大化 Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a)

ϕargmaxϕ(s,a)Qθ(s,a+ξϕ(s,a,Φ)).italic-ϕsubscriptargmaxitalic-ϕsubscript𝑠𝑎subscript𝑄𝜃𝑠𝑎subscript𝜉italic-ϕ𝑠𝑎Φ\phi\leftarrow\operatorname*{argmax}_{\phi}\sum_{(s,a)\in\mathcal{B}}Q_{\theta}(s,a+\xi_{\phi}(s,a,\Phi)). (12)

To penalize uncertainty over future states, we modify Clipped Double Q-learning (Fujimoto et al., 2018), which estimates the value by taking the minimum between two Q-networks {Qθ1,Qθ2}subscript𝑄subscript𝜃1subscript𝑄subscript𝜃2\{Q_{\theta_{1}},Q_{\theta_{2}}\}. Although originally used as a countermeasure to overestimation bias (Thrun & Schwartz, 1993; Van Hasselt, 2010), the minimum operator also penalizes high variance estimates in regions of uncertainty, and pushes the policy to favor actions which lead to states contained in the batch. In particular, we take a convex combination of the two values, with a higher weight on the minimum, to form a learning target which is used by both Q-networks:
为了惩罚对未来状态的不确定性,我们修改了 Clipped Double Q-learning(Fujimoto 等人,2018),该方法通过两个 Q 网络之间取最小值来估计价值。尽管最初被用作对过度估计偏差的对策(Thrun&Schwartz,1993;Van Hasselt,2010),最小操作符也惩罚不确定性区域中的高方差估计,并推动策略偏向导致批处理中包含的状态的动作。具体来说,我们采用两个值的凸组合,更重视最小值,形成一个学习目标,被两个 Q 网络使用:

r+γmaxai[λminj=1,2Qθj(s,ai)+(1λ)maxj=1,2Qθj(s,ai)]𝑟𝛾subscriptsubscript𝑎𝑖𝜆subscript𝑗12subscript𝑄subscriptsuperscript𝜃𝑗superscript𝑠subscript𝑎𝑖1𝜆subscript𝑗12subscript𝑄subscriptsuperscript𝜃𝑗superscript𝑠subscript𝑎𝑖r+\gamma\max_{a_{i}}\left[\lambda\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},a_{i})+(1-\lambda)\max_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},a_{i})\right] (13)

where aisubscript𝑎𝑖a_{i} corresponds to the perturbed actions, sampled from the generative model. If we set λ=1𝜆1\lambda=1, this update corresponds to Clipped Double Q-learning. We use this weighted minimum as the constrained updates produces less overestimation bias than a purely greedy policy update, and enables control over how heavily uncertainty at future time steps is penalized through the choice of λ𝜆\lambda.
其中 aisubscript𝑎𝑖a_{i} 对应于从生成模型中采样的扰动动作。如果我们设置 λ=1𝜆1\lambda=1 ,这个更新对应于 Clipped Double Q-learning。我们使用这个加权最小值作为受限更新,因为相比于纯贪婪策略更新,它产生更少的高估偏差,并且通过选择 λ𝜆\lambda 可以控制对未来时间步长的不确定性的惩罚程度。

This forms Batch-Constrained deep Q-learning (BCQ), which maintains four parametrized networks: a generative model Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s), a perturbation model ξϕ(s,a)subscript𝜉italic-ϕ𝑠𝑎\xi_{\phi}(s,a), and two Q-networks Qθ1(s,a),Qθ2(s,a)subscript𝑄subscript𝜃1𝑠𝑎subscript𝑄subscript𝜃2𝑠𝑎Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a). We summarize BCQ in Algorithm 1. In the following section, we demonstrate BCQ results in stable value learning and a strong performance in the batch setting. Furthermore, we find that only a single choice of hyper-parameters is necessary for a wide range of tasks and environments.
这形成了批次约束深度 Q 学习(BCQ),它维护四个参数化网络:一个生成模型,一个扰动模型和两个 Q 网络。我们在算法 1 中总结了 BCQ。在接下来的部分中,我们展示了 BCQ 在稳定值学习和批处理设置中的强大表现。此外,我们发现对于广泛的任务和环境,只需要一个超参数选择。

5 Experiments 5 实验

Refer to caption
Refer to caption
(a) Final buffer performance
Refer to caption
(b) Concurrent performance
Refer to caption
(c) Imitation performance 模仿表演
Refer to caption
(d) Imperfect demonstrations performance
Figure 2: We evaluate BCQ and several baselines on the experiments from Section 3.1, as well as the imperfect demonstrations task. The shaded area represents half a standard deviation. The bold black line measures the average return of episodes contained in the batch. Only BCQ matches or outperforms the performance of the behavioral policy in all tasks.
图 2:我们在第 3.1 节的实验以及不完美演示任务中评估了 BCQ 和几个基线。阴影区域代表半个标准差。粗黑线测量了批次中包含的剧集的平均回报。只有 BCQ 在所有任务中与行为策略的表现相匹配或表现更好。

To evaluate the effectiveness of Batch-Constrained deep Q-learning (BCQ) in a high-dimensional setting, we focus on MuJoCo environments in OpenAI gym (Todorov et al., 2012; Brockman et al., 2016). For reproducibility, we make no modifications to the original environments or reward functions. We compare our method with DDPG (Lillicrap et al., 2015), DQN (Mnih et al., 2015) using an independently discretized action space, a feed-forward behavioral cloning method (BC), and a variant with a VAE (VAE-BC), using Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) from BCQ. Exact implementation and experimental details are provided in the Supplementary Material.
为了评估批量约束深度 Q 学习(BCQ)在高维环境中的有效性,我们专注于 OpenAI gym 中的 MuJoCo 环境(Todorov 等,2012 年;Brockman 等,2016 年)。为了可重现性,我们对原始环境或奖励函数不做任何修改。我们将我们的方法与 DDPG(Lillicrap 等,2015 年)、DQN(Mnih 等,2015 年)使用独立离散化的动作空间、前馈行为克隆方法(BC)以及带有 VAE 的变体(VAE-BC)进行比较,使用 BCQ 中的 Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s) 。具体实现和实验细节请参阅附录材料。

Refer to caption
Refer to caption
(a) Final Buffer 最终缓冲区
(b) Concurrent 并发
(c) Imitation 模仿
Figure 3: We examine the value estimates of BCQ, along with DDPG and DQN on the experiments from Section 3.1 in the Hopper-v1 environment. Each individual trial is plotted, with the mean in bold. An estimate of the true value of BCQ, evaluated by Monte Carlo returns, is marked by a dotted line. Unlike the state of the art baselines, BCQ exhibits a highly stable value function in each task. Graphs for the other environments and imperfect demonstrations task can be found in the Supplementary Material.
图 3:我们在 Hopper-v1 环境中对 BCQ、DDPG 和 DQN 的价值估计进行了研究,这些研究是在第 3.1 节的实验中进行的。每个单独的试验都被绘制出来,均值用粗体表示。通过蒙特卡洛回报评估的 BCQ 真实价值估计用虚线标记。与现有技术基线不同,BCQ 在每个任务中展现出高度稳定的价值函数。其他环境和不完美演示任务的图表可以在补充材料中找到。

We evaluate each method following the three experiments defined in Section 3.1. In final buffer the off-policy agents learn from the final replay buffer gathered by training a DDPG agent over a million time steps. In concurrent the off-policy agents learn concurrently, with the same replay buffer, as the behavioral DDPG policy, and in imitation, the agents learn from a dataset collected by an expert policy. Additionally, to study the robustness of BCQ to noisy and multi-modal data, we include an imperfect demonstrations task, in which the agents are trained with a batch of 100k transitions collected by an expert policy, with two sources of noise. The behavioral policy selects actions randomly with probability and with high exploratory noise 𝒩(0,0.3)𝒩00.3\mathcal{N}(0,0.3) added to the remaining actions. The experimental results for these tasks are reported in Figure 2. Furthermore, the estimated values of BCQ, DDPG and DQN, and the true value of BCQ are displayed in Figure 3.
我们根据第 3.1 节中定义的三个实验评估每种方法。在最终缓冲区中,离策略代理从通过训练 DDPG 代理器获得的最终重放缓冲区中学习了一百万个时间步骤。在并发中,离策略代理同时学习,使用相同的重放缓冲区,作为行为 DDPG 策略,而在模仿中,代理从专家策略收集的数据集中学习。此外,为了研究 BCQ 对嘈杂和多模态数据的鲁棒性,我们包括一个不完美的演示任务,代理通过专家策略收集的 10 万个转换的批次进行训练,其中包含两种噪声来源。行为策略以概率

Our approach, BCQ, is the only algorithm which succeeds at all tasks, matching or outperforming the behavioral policy in each instance, and outperforming all other agents, besides in the imitation learning task where behavioral cloning unsurprisingly performs the best. These results demonstrate that our algorithm can be used as a single approach for both imitation learning and off-policy reinforcement learning, with a single set of fixed hyper-parameters. Furthermore, unlike the deep reinforcement learning algorithms, DDPG and DQN, BCQ exhibits a highly stable value function in the presence of off-policy samples, suggesting extrapolation error has been successfully mitigated through the batch-constraint. In the imperfect demonstrations task, we find that both deep reinforcement learning and imitation learning algorithms perform poorly. BCQ, however, is able to strongly outperform the noisy demonstrator, disentangling poor and expert actions. Furthermore, compared to current deep reinforcement learning algorithms, which can require millions of time steps (Duan et al., 2016; Henderson et al., 2017), BCQ attains a high performance in remarkably few iterations. This suggests our approach effectively leverages expert transitions, even in the presence of noise.
我们的方法 BCQ 是唯一成功完成所有任务的算法,在每个实例中与行为策略匹配或胜过它,并且在模仿学习任务中胜过所有其他代理,除了行为克隆当然表现最好。这些结果表明我们的算法可以作为单一方法用于模仿学习和离策略强化学习,使用一组固定的超参数。此外,与深度强化学习算法 DDPG 和 DQN 不同,BCQ 在离策略样本存在时表现出高度稳定的值函数,表明通过批次约束成功地减轻了外推误差。在不完美的演示任务中,我们发现深度强化学习和模仿学习算法表现不佳。然而,BCQ 能够强大地胜过嘈杂的演示者,区分出不良和专家行为。此外,与当前的深度强化学习算法相比,后者可能需要数百万个时间步(Duan 等,2016; Henderson 等,2017),BCQ 在极少的迭代中取得了高性能。 这表明我们的方法在存在噪音的情况下有效地利用专家过渡。

6 Related Work 相关工作

Batch Reinforcement Learning. While batch reinforcement learning algorithms have been shown to be convergent with non-parametric function approximators such as averagers (Gordon, 1995) and kernel methods (Ormoneit & Sen, 2002), they make no guarantees on the quality of the policy without infinite data. Other batch algorithms, such as fitted Q-iteration, have used other function approximators, including decision trees (Ernst et al., 2005) and neural networks (Riedmiller, 2005), but come without convergence guarantees. Unlike many previous approaches to off-policy policy evaluation (Peshkin & Shelton, 2002; Thomas et al., 2015; Liu et al., 2018), our work focuses on constraining the policy to a subset of policies which can be adequately evaluated, rather than the process of evaluation itself. Additionally, off-policy algorithms which rely on importance sampling (Precup et al., 2001; Jiang & Li, 2016; Munos et al., 2016) may not be applicable in a batch setting, requiring access to the action probabilities under the behavioral policy, and scale poorly to multi-dimensional action spaces. Reinforcement learning with a replay buffer (Lin, 1992) can be considered a form of batch reinforcement learning, and is a standard tool for off-policy deep reinforcement learning algorithms (Mnih et al., 2015). It has been observed that a large replay buffer can be detrimental to performance (de Bruin et al., 2015; Zhang & Sutton, 2017) and the diversity of states in the buffer is an important factor for performance (de Bruin et al., 2016). Isele & Cosgun (2018) observed the performance of an agent was strongest when the distribution of data in the replay buffer matched the test distribution. These results defend the notion that extrapolation error is an important factor in the performance off-policy reinforcement learning.
批量强化学习。尽管已经证明批量强化学习算法与非参数函数逼近器(如平均值器(Gordon,1995)和核方法(Ormoneit&Sen,2002))收敛,但在没有无限数据的情况下,它们不能保证策略的质量。其他批量算法,如拟合 Q-迭代,已经使用其他函数逼近器,包括决策树(Ernst 等,2005)和神经网络(Riedmiller,2005),但没有收敛保证。与以往许多关于离线策略评估的方法不同(Peshkin&Shelton,2002;Thomas 等,2015;Liu 等,2018),我们的工作侧重于将策略限制在可以充分评估的策略子集上,而不是评估过程本身。此外,依赖重要性抽样的离线算法(Precup 等,2001;Jiang&Li,2016;Munos 等,2016)可能在批量设置中不适用,需要访问行为策略下的动作概率,并且在多维动作空间中扩展性较差。 使用回放缓冲区的强化学习(Lin,1992)可以被视为一种批量强化学习的形式,并且是离策略深度强化学习算法的标准工具(Mnih 等,2015)。已经观察到,大型回放缓冲区可能对性能有害(de Bruin 等,2015;Zhang 和 Sutton,2017),缓冲区中状态的多样性是性能的重要因素(de Bruin 等,2016)。Isele 和 Cosgun(2018)观察到,当回放缓冲区中的数据分布与测试分布匹配时,代理的性能最强。这些结果支持了外推误差是离策略强化学习性能的重要因素的观点。

Imitation Learning. Imitation learning and its variants are well studied problems (Schaal, 1999; Argall et al., 2009; Hussein et al., 2017). Imitation has been combined with reinforcement learning, via learning from demonstrations methods (Kim et al., 2013; Piot et al., 2014; Chemali & Lazaric, 2015), with deep reinforcement learning extensions (Hester et al., 2017; Večerík et al., 2017), and modified policy gradient approaches (Ho et al., 2016; Sun et al., 2017; Cheng et al., 2018; Sun et al., 2018). While effective, these interactive methods are inadequate for batch reinforcement learning as they require either an explicit distinction between expert and non-expert data, further on-policy data collection or access to an oracle. Research in imitation, and inverse reinforcement learning, with robustness to noise is an emerging area (Evans, 2016; Nair et al., 2018), but relies on some form of expert data. Gao et al. (2018) introduced an imitation learning algorithm which learned from imperfect demonstrations, by favoring seen actions, but is limited to discrete actions. Our work also connects to residual policy learning (Johannink et al., 2018; Silver et al., 2018), where the initial policy is the generative model, rather than an expert or feedback controller.
模仿学习。模仿学习及其变体是研究较多的问题(Schaal, 1999; Argall 等,2009; Hussein 等,2017)。模仿已经与强化学习相结合,通过从示范中学习的方法(Kim 等,2013; Piot 等,2014; Chemali & Lazaric,2015),以及深度强化学习扩展(Hester 等,2017; Večerík 等,2017),以及修改的策略梯度方法(Ho 等,2016; Sun 等,2017; Cheng 等,2018; Sun 等,2018)。尽管有效,这些交互式方法对于批量强化学习来说是不足的,因为它们要求明确区分专家和非专家数据,进一步进行在线数据收集或访问预言者。模仿学习和逆向强化学习的研究,以及对噪声的鲁棒性是一个新兴领域(Evans,2016; Nair 等,2018),但依赖于某种形式的专家数据。Gao 等(2018)提出了一种模仿学习算法,该算法从不完美的示范中学习,通过偏爱已见行动,但仅限于离散行动。我们的工作也与残差策略学习相关(Johannink 等,2018; Silver 等)。在这种情况下,初始政策是生成模型,而不是专家或反馈控制器。

Uncertainty in Reinforcement Learning. Uncertainty estimates in deep reinforcement learning have generally been used to encourage exploration (Dearden et al., 1998; Strehl & Littman, 2008; O’Donoghue et al., 2018; Azizzadenesheli et al., 2018). Other methods have examined approximating the Bayesian posterior of the value function (Osband et al., 2016, 2018; Touati et al., 2018), again using the variance to encourage exploration to unseen regions of the state space. In model-based reinforcement learning, uncertainty has been used for exploration, but also for the opposite effect–to push the policy towards regions of certainty in the model. This is used to combat the well-known problems with compounding model errors, and is present in policy search methods (Deisenroth & Rasmussen, 2011; Gal et al., 2016; Higuera et al., 2018; Xu et al., 2018), or combined with trajectory optimization (Chua et al., 2018) or value-based methods (Buckman et al., 2018). Our work connects to policy methods with conservative updates (Kakade & Langford, 2002), such as trust region (Schulman et al., 2015; Achiam et al., 2017; Pham et al., 2018) and information-theoretic methods (Peters & Mülling, 2010; Van Hoof et al., 2017), which aim to keep the updated policy similar to the previous policy. These methods avoid explicit uncertainty estimates, and rather force policy updates into a constrained range before collecting new data, limiting errors introduced by large changes in the policy. Similarly, our approach can be thought of as an off-policy variant, where the policy aims to be kept close, in output space, to any combination of the previous policies which performed data collection.
强化学习中的不确定性。深度强化学习中的不确定性估计通常用于鼓励探索。其他方法则试图近似值函数的贝叶斯后验,再次利用方差来鼓励探索未知状态空间的区域。在基于模型的强化学习中,不确定性被用于探索,同时也用于相反的效果——将策略推向模型中确定性的区域。这用于解决与模型错误叠加相关的众所周知的问题,并出现在策略搜索方法中,或与轨迹优化或基于值的方法相结合。我们的工作与具有保守更新的策略方法相关。2018 年)和信息论方法(Peters&Mülling,2010;Van Hoof 等,2017 年),旨在使更新的策略与先前的策略相似。这些方法避免明确的不确定性估计,而是在收集新数据之前将策略更新强制限制在一定范围内,限制由策略大幅变化引入的错误。类似地,我们的方法可以被视为一种离线策略变体,其中策略旨在在输出空间中保持接近于执行数据收集的任何组合的先前策略。

7 Conclusion 结论

In this work, we demonstrate a critical problem in off-policy reinforcement learning with finite data, where the value target introduces error by including an estimate of unseen state-action pairs. This phenomenon, which we denote extrapolation error, has important implications for off-policy and batch reinforcement learning, as it is generally implausible to have complete state-action coverage in any practical setting. We present batch-constrained reinforcement learning–acting close to on-policy with respect to the available data, as an answer to extrapolation error. When extended to a deep reinforcement learning setting, our algorithm, Batch-Constrained deep Q-learning (BCQ), is the first continuous control algorithm capable of learning from arbitrary batch data, without exploration. Due to the importance of batch reinforcement learning for practical applications, we believe BCQ will be a strong foothold for future algorithms to build on, while furthering our understanding of the systematic risks in Q-learning (Thrun & Schwartz, 1993; Lu et al., 2018).
在这项工作中,我们展示了离线策略强化学习中有限数据的一个关键问题,即值目标通过包含对未见状态-动作对的估计而引入错误。我们将这种现象称为外推误差,对离线和批量强化学习具有重要意义,因为在任何实际情况下完全覆盖状态-动作是不切实际的。我们提出了批量约束强化学习——在可用数据方面接近于在线策略,作为对外推误差的解决方案。当扩展到深度强化学习环境时,我们的算法,批量约束深度 Q 学习(BCQ),是第一个能够从任意批量数据中学习而无需探索的连续控制算法。由于批量强化学习对实际应用的重要性,我们相信 BCQ 将成为未来算法发展的一个坚实基础,同时深化我们对 Q 学习中系统风险的理解(Thrun & Schwartz,1993;Lu 等,2018)。

References 参考资料

  • Achiam et al. (2017) Achiam 等人(2017) Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International Conference on Machine Learning, pp.  22–31, 2017.
    Achiam, J., Held, D., Tamar, A., 和 Abbeel, P. 受限策略优化。在机器学习国际会议上,页码 22-31,2017 年。
  • Argall et al. (2009) Argall 等人(2009) Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
    Argall, B. D., Chernova, S., Veloso, M., 和 Browning, B. 机器人示教学习调查。《机器人与自主系统》,57(5):469–483,2009。
  • Azizzadenesheli et al. (2018)
    Azizzadenesheli 等人(2018)
    Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. Efficient exploration through bayesian deep q-networks. arXiv preprint arXiv:1802.04412, 2018.
    Azizzadenesheli, K., Brunskill, E., 和 Anandkumar, A. 高效探索通过贝叶斯深度 Q 网络. arXiv 预印本 arXiv:1802.04412, 2018.
  • Bertsekas & Tsitsiklis (1996)
    Bertsekas & Tsitsiklis(1996)
    Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena scientific Belmont, MA, 1996.
    Bertsekas, D. P. 和 Tsitsiklis, J. N.《神经动态规划》。Athena scientific Belmont, MA, 1996.
  • Brockman et al. (2016) Brockman 等人(2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., 和 Zaremba, W. OpenAI Gym, 2016.
  • Buckman et al. (2018) Buckman 等人(2018) Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8234–8244, 2018.
    Buckman, J., Hafner, D., Tucker, G., Brevdo, E., 和 Lee, H. 随机集成值扩展的样本高效强化学习。在神经信息处理系统的进展中,第 8234-8244 页,2018 年。
  • Chemali & Lazaric (2015) Chemali, J. and Lazaric, A. Direct policy iteration with demonstrations. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
    Chemali, J.和 Lazaric, A.直接政策迭代与演示。在 2015 年第二十四届国际人工智能联合会议论文集中。
  • Cheng et al. (2018) 程等人(2018) Cheng, C.-A., Yan, X., Wagener, N., and Boots, B. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413, 2018.
    Cheng, C.-A., Yan, X., Wagener, N., 和 Boots, B. 通过模仿和强化快速政策学习. arXiv 预印本 arXiv:1805.10413, 2018.
  • Chua et al. (2018) 蔡等人(2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, pp. 4759–4770, 2018.
    Chua, K., Calandra, R., McAllister, R., 和 Levine, S. 利用概率动力学模型在少数试验中进行深度强化学习。在 Advances in Neural Information Processing Systems 31, pp. 4759–4770, 2018.
  • Dayan & Watkins (1992)
    Dayan, P. and Watkins, C. J. C. H. Q-learning. Machine learning, 8(3):279–292, 1992.
    Dayan, P. 和 Watkins, C. J. C. H. Q-learning. 机器学习, 8(3):279–292, 1992.
  • de Bruin et al. (2015) de Bruin 等人(2015) de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. The importance of experience replay database composition in deep reinforcement learning. In Deep Reinforcement Learning Workshop, NIPS, 2015.
    de Bruin, T., Kober, J., Tuyls, K., 和 Babuška, R. 经验重放数据库构成在深度强化学习中的重要性。在 NIPS 的深度强化学习研讨会上,2015 年。
  • de Bruin et al. (2016) de Bruin 等人(2016) de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. Improved deep reinforcement learning for robotics through distribution-based experience retention. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  3947–3952. IEEE, 2016.
    de Bruin, T., Kober, J., Tuyls, K., 和 Babuška, R. 通过基于分布的经验保留改进机器人的深度强化学习。在 IEEE/RSJ 国际智能机器人与系统大会(IROS)上发表,第 3947-3952 页。IEEE,2016。
  • Dearden et al. (1998) Dearden 等人(1998) Dearden, R., Friedman, N., and Russell, S. Bayesian q-learning. In AAAI/IAAI, pp.  761–768, 1998.
    Dearden, R., Friedman, N., 和 Russell, S. 贝叶斯 Q 学习. 在 AAAI/IAAI 会议上,页码 761-768,1998 年。
  • Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pp. 465–472, 2011.
    Deisenroth, M. 和 Rasmussen, C. E. Pilco: 一种基于模型且数据高效的策略搜索方法。在 2011 年机器学习国际会议上,第 465-472 页。
  • Duan et al. (2016) 段等人(2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
    段宇,陈旭,豪特夫特,舒尔曼和阿贝尔。连续控制的深度强化学习基准测试。在机器学习国际会议上,第 1329-1338 页,2016 年。
  • Ernst et al. (2005) 恩斯特等人(2005 年) Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
    Ernst, D., Geurts, P., 和 Wehenkel, L. 基于树的批处理模式强化学习。机器学习研究杂志, 6(Apr):503–556, 2005.
  • Evans (2016) Evans, O. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp.  323–329, 2016.
    Evans, O. 学习无知、不一致代理的偏好。在 AAAI,第 323-329 页,2016 年。
  • Fujimoto et al. (2018) 藤本等人(2018) Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, volume 80, pp.  1587–1596. PMLR, 2018.
    Fujimoto, S., van Hoof, H., 和 Meger, D. 解决演员-评论家方法中的函数逼近误差。在机器学习国际会议上,第 80 卷,第 1587-1596 页。PMLR,2018。
  • Gal et al. (2016) Gal 等人(2016) Gal, Y., McAllister, R., and Rasmussen, C. E. Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, International Conference on Machine Learning, 2016.
    Gal, Y., McAllister, R., 和 Rasmussen, C. E. 用贝叶斯神经网络动力学模型改进 PILCO。在 2016 年国际机器学习会议的数据高效机器学习研讨会上。
  • Gao et al. (2018) 高等(2018) Gao, Y., Lin, J., Yu, F., Levine, S., and Darrell, T. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.
    Gao, Y., Lin, J., Yu, F., Levine, S., 和 Darrell, T. 不完美演示的强化学习。arXiv 预印本 arXiv:1802.05313, 2018.
  • Goodfellow et al. (2013)
    Goodfellow 等人(2013)
    Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
    Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., 和 Bengio, Y. 基于梯度的神经网络中灾难性遗忘的实证研究. arXiv 预印本 arXiv:1312.6211, 2013.
  • Gordon (1995) 戈登(1995) Gordon, G. J. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pp.  261–268. Elsevier, 1995.
    Gordon, G. J. 动态规划中的稳定函数逼近。在 1995 年机器学习会议论文集中,第 261-268 页。Elsevier,1995 年。
  • Henderson et al. (2017) Henderson 等人(2017) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. arXiv preprint arXiv:1709.06560, 2017.
    Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., 和 Meger, D. 重要的深度强化学习. arXiv 预印本 arXiv:1709.06560, 2017.
  • Hester et al. (2017) Hester 等人(2017) Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
    Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., 等。基于演示的深度 Q 学习。arXiv 预印本 arXiv:1704.03732,2017。
  • Higuera et al. (2018) Higuera 等人(2018) Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizing neural network controllers with probabilistic model based reinforcement learning. arXiv preprint arXiv:1803.02291, 2018.
    Higuera, J. C. G., Meger, D., 和 Dudek, G. 用基于概率模型的强化学习合成神经网络控制器。arXiv 预印本 arXiv:1803.02291, 2018.
  • Ho et al. (2016) 何等。 (2016 年) Ho, J., Gupta, J., and Ermon, S. Model-free imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769, 2016.
    Ho, J., Gupta, J., 和 Ermon, S. 无模型模仿学习与策略优化。在机器学习国际会议上,第 2760-2769 页,2016 年。
  • Hussein et al. (2017) Hussein 等人(2017) Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017.
    Hussein, A., Gaber, M. M., Elyan, E., 和 Jayne, C. 模仿学习:学习方法综述。ACM 计算调查(CSUR),50(2):21,2017。
  • Isele & Cosgun (2018) Isele, D. and Cosgun, A. Selective experience replay for lifelong learning. arXiv preprint arXiv:1802.10269, 2018.
    Isele, D. 和 Cosgun, A. 终身学习的选择性经验重播。arXiv 预印本 arXiv:1802.10269,2018。
  • Jaksch et al. (2010) Jaksch 等人(2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
    Jaksch, T., Ortner, R., 和 Auer, P. 强化学习的近乎最优遗憾界. 机器学习研究杂志, 11(Apr):1563–1600, 2010.
  • Jiang & Li (2016)
    Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661, 2016.
    姜宁和李磊。强化学习的双重稳健离线策略价值评估。在 2016 年机器学习国际会议上,第 652-661 页。
  • Johannink et al. (2018) Johannink 等人(2018) Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J. A., Solowjow, E., and Levine, S. Residual reinforcement learning for robot control. arXiv preprint arXiv:1812.03201, 2018.
    Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J. A., Solowjow, E., 和 Levine, S. 残余强化学习用于机器人控制。arXiv 预印本 arXiv:1812.03201, 2018.
  • Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pp.  267–274, 2002.
    Kakade, S. 和 Langford, J. 近似最优近似强化学习。在机器学习国际会议上,第 2 卷,2002 年,第 267-274 页。
  • Kim et al. (2013) Kim 等人(2013) Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.
    Kim, B., Farahmand, A.-m., Pineau, J., 和 Precup, D. 从有限演示中学习. 在神经信息处理系统的进展中, 页码 2859–2867, 2013.
  • Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    Kingma, D.和 Ba, J. Adam: 一种随机优化方法。arXiv 预印本 arXiv:1412.6980,2014。
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
    Kingma, D. P. 和 Welling, M. 自编码变分贝叶斯。arXiv 预印本 arXiv:1312.6114,2013。
  • Konda & Tsitsiklis (2003)
    Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM journal on Control and Optimization, 42(4):1143–1166, 2003.
    Konda, V. R.和 Tsitsiklis, J. N.关于演员-评论家算法。SIAM 控制与优化杂志,42(4):1143–1166,2003。
  • Lai & Robbins (1985)
    Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
    Lai, T. L.和 Robbins, H. 渐近有效的自适应分配规则。应用数学进展,6(1):4–22,1985。
  • Lange et al. (2012) Lange 等人(2012) Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp.  45–73. Springer, 2012.
    Lange, S., Gabel, T., 和 Riedmiller, M. 批量强化学习. 在强化学习中, 第 45-73 页. Springer, 2012.
  • Lillicrap et al. (2015) Lillicrap 等人(2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
    Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., 和 Wierstra, D. 深度强化学习的连续控制。arXiv 预印本 arXiv:1509.02971,2015。
  • Lin (1992) 林(1992) Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
    Lin, L.-J.基于强化学习、规划和教学的自我改进的反应式智能体。机器学习,8(3-4):293–321,1992。
  • Liu et al. (2018) 刘等人(2018) Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., and Brunskill, E. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2644–2653, 2018.
    刘,Y.,戈特斯曼,O.,拉古,A.,科莫罗夫斯基,M.,费萨尔,A. A.,多什-韦莱兹,F.,布伦斯基尔,E. 代表平衡 MDPS 用于离线策略评估。在神经信息处理系统的进展中,2018 年,第 2644-2653 页。
  • Lu et al. (2018) 陆等人(2018) Lu, T., Schuurmans, D., and Boutilier, C. Non-delusional q-learning and value-iteration. In Advances in Neural Information Processing Systems, pp. 9971–9981, 2018.
    Lu, T., Schuurmans, D., 和 Boutilier, C. 非妄想 q-learning 和值迭代。在神经信息处理系统的进展中,第 9971-9981 页,2018 年。
  • McCloskey & Cohen (1989)
    McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pp. 109–165. Elsevier, 1989.
    McCloskey, M. 和 Cohen, N. J. 连接主义网络中的灾难性干扰:顺序学习问题。收录于《学习与动机心理学》,第 24 卷,第 109–165 页。Elsevier,1989。
  • Melo (2001) 梅洛(2001) Melo, F. S. Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pp.  1–4, 2001.
    Melo, F. S. Q-learning 的收敛性:一个简单的证明。系统与机器人研究所,技术报告,第 1-4 页,2001 年。
  • Mnih et al. (2015) Mnih 等人(2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., 等。 通过深度强化学习实现人类水平控制。 《自然》, 518(7540):529–533, 2015.
  • Munos et al. (2016) Munos 等人(2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
    Munos, R., Stepleton, T., Harutyunyan, A., 和 Bellemare, M. 安全高效的离策略强化学习。在神经信息处理系统进展中,第 1054-1062 页,2016 年。
  • Nair et al. (2018) Nair 等人(2018) Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  6292–6299. IEEE, 2018.
    Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., 和 Abbeel, P. 克服强化学习中的探索问题:演示方法。在 2018 年 IEEE 国际机器人与自动化大会(ICRA)上,第 6292-6299 页。IEEE,2018。
  • O’Donoghue et al. (2018)
    O’Donoghue 等人(2018)
    O’Donoghue, B., Osband, I., Munos, R., and Mnih, V. The uncertainty Bellman equation and exploration. In International Conference on Machine Learning, volume 80, pp.  3839–3848. PMLR, 2018.
    O’Donoghue, B., Osband, I., Munos, R., 和 Mnih, V. 不确定性贝尔曼方程和探索。在机器学习国际会议上,第 80 卷,3839-3848 页。PMLR,2018。
  • Ormoneit & Sen (2002) Ormoneit, D. and Sen, Ś. Kernel-based reinforcement learning. Machine learning, 49(2-3):161–178, 2002.
    Ormoneit, D. 和 Sen, Ś. 基于核的强化学习. 机器学习, 49(2-3):161–178, 2002.
  • Osband et al. (2016) Osband 等人(2016) Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pp. 4026–4034, 2016.
    Osband, I., Blundell, C., Pritzel, A., 和 Van Roy, B. 深度探索通过自举 DQN。在神经信息处理系统的进展中,第 4026-4034 页,2016 年。
  • Osband et al. (2018) Osband 等人(2018) Osband, I., Aslanides, J., and Cassirer, A. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems 31, pp. 8626–8638, 2018.
    Osband, I., Aslanides, J., 和 Cassirer, A. 随机先验函数用于深度强化学习. 在《神经信息处理系统进展》第 31 卷, pp. 8626–8638, 2018.
  • Peshkin & Shelton (2002) Peshkin, L. and Shelton, C. R. Learning from scarce experience. In International Conference on Machine Learning, pp. 498–505, 2002.
    Peshkin, L. 和 Shelton, C. R. 从稀缺经验中学习。在 2002 年机器学习国际会议上,第 498-505 页。
  • Peters & Mülling (2010)
    Peters, J. and Mülling, K. Relative entropy policy search. In AAAI, pp.  1607–1612, 2010.
    Peters, J. 和 Mülling, K. 相对熵策略搜索. 在 AAAI, 第 1607-1612 页, 2010 年。
  • Pham et al. (2018) Pham 等人(2018) Pham, T.-H., De Magistris, G., Agravante, D. J., Chaudhury, S., Munawar, A., and Tachibana, R. Constrained exploration and recovery from experience shaping. arXiv preprint arXiv:1809.08925, 2018.
    Pham, T.-H., De Magistris, G., Agravante, D. J., Chaudhury, S., Munawar, A., 和 Tachibana, R. 受限制的探索和经验塑造恢复。arXiv 预印本 arXiv:1809.08925, 2018.
  • Piot et al. (2014) Piot 等人(2014) Piot, B., Geist, M., and Pietquin, O. Boosted bellman residual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  549–564. Springer, 2014.
    Piot, B., Geist, M., 和 Pietquin, O. 增强贝尔曼残差最小化处理专家演示。在欧洲机器学习和数据库知识发现联合会议上,549-564 页。Springer,2014。
  • Precup et al. (2001) Precup 等人(2001) Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pp. 417–424, 2001.
    Precup, D., Sutton, R. S., 和 Dasgupta, S. 利用函数逼近的离策略时序差分学习。在 2001 年机器学习国际会议上,第 417-424 页。
  • Rezende et al. (2014) Rezende 等人(2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
    Rezende, D. J., Mohamed, S., 和 Wierstra, D. 深度生成模型中的随机反向传播和近似推断. arXiv 预印本 arXiv:1401.4082, 2014.
  • Riedmiller (2005) Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp.  317–328. Springer, 2005.
    Riedmiller, M. 神经拟合 Q 迭代——一种数据高效的神经强化学习方法的初步体验。在欧洲机器学习会议上,第 317-328 页。Springer,2005。
  • Schaal (1999) 舍尔(1999) Schaal, S. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.
    Schaal, S. 模仿学习是通向类人机器人的途径吗?《认知科学趋势》,3(6):233–242,1999。
  • Schulman et al. (2015) Schulman 等人(2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
    Schulman, J., Levine, S., Abbeel, P., Jordan, M., 和 Moritz, P. 信任区域策略优化. 在机器学习国际会议上,页码 1889–1897,2015 年。
  • Silver et al. (2014) Silver 等人(2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395, 2014.
    Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., 和 Riedmiller, M. 确定性策略梯度算法. 在机器学习国际会议上, 第 387-395 页, 2014.
  • Silver et al. (2018) Silver 等人(2018) Silver, T., Allen, K., Tenenbaum, J., and Kaelbling, L. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018.
    Silver, T., Allen, K., Tenenbaum, J., 和 Kaelbling, L. 残差策略学习. arXiv 预印本 arXiv:1812.06298, 2018.
  • Singh et al. (2000) 辛格等人(2000) Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
    Singh, S., Jaakkola, T., Littman, M. L., 和 Szepesvári, C. 单步在线策略强化学习算法的收敛结果。机器学习,38(3):287–308,2000。
  • Sohn et al. (2015) Sohn 等人(2015) Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491, 2015.
    Sohn, K., Lee, H., 和 Yan, X. 使用深度条件生成模型学习结构化输出表示。在神经信息处理系统的进展中,第 3483-3491 页,2015 年。
  • Strehl & Littman (2008) Strehl, A. L. and Littman, M. L. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
    Strehl, A. L.和 Littman, M. L.对马尔可夫决策过程的基于模型的区间估计进行了分析。《计算机与系统科学杂志》,74(8):1309–1331,2008。
  • Sun et al. (2017) 孙等人(2017) Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning, pp. 3309–3318, 2017.
    Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., 和 Bagnell, J. A. 深度激怒:用于序列预测的可微模仿学习。在机器学习国际会议上,第 3309-3318 页,2017 年。
  • Sun et al. (2018) 孙等人(2018) Sun, W., Bagnell, J. A., and Boots, B. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.
    Sun, W., Bagnell, J. A., 和 Boots, B. 截断视野策略搜索:结合强化学习和模仿学习。arXiv 预印本 arXiv:1805.11240,2018。
  • Sutton (1988) 萨顿(1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
    Sutton, R. S. 时间差异方法预测学习。机器学习,3(1):9–44,1988。
  • Sutton & Barto (1998)
    Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
    萨顿,R.S.和巴托,A.G.强化学习:导论,第 1 卷。麻省理工学院出版社,1998 年。
  • Thomas et al. (2015) Thomas 等人(2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388, 2015.
    Thomas, P., Theocharous, G., 和 Ghavamzadeh, M. 高置信度策略改进. 在机器学习国际会议上,第 2380-2388 页,2015 年。
  • Thrun & Schwartz (1993)
    Thrun & Schwartz(1993)
    Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
    Thrun, S.和 Schwartz, A.在强化学习中使用函数逼近的问题。在 1993 年连接主义模型暑期学校会议论文集中。劳伦斯埃尔巴姆,1993。
  • Todorov et al. (2012) Todorov 等人(2012) Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  5026–5033. IEEE, 2012.
    Todorov, E., Erez, T., 和 Tassa, Y. Mujoco: 一个基于模型控制的物理引擎。在 IEEE/RSJ 国际智能机器人与系统大会(IROS)上,5026-5033 页。IEEE,2012。
  • Touati et al. (2018) Touati 等人(2018) Touati, A., Satija, H., Romoff, J., Pineau, J., and Vincent, P. Randomized value functions via multiplicative normalizing flows. arXiv preprint arXiv:1806.02315, 2018.
    Touati, A., Satija, H., Romoff, J., Pineau, J., 和 Vincent, P. 通过乘法归一化流的随机值函数。arXiv 预印本编号 arXiv:1806.02315,2018。
  • Van Hasselt (2010) 范哈塞尔特(2010) Van Hasselt, H. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010.
    Van Hasselt, H. 双 Q 学习. 在神经信息处理系统的进展中,第 2613-2621 页,2010 年。
  • Van Hasselt et al. (2016)
    Van Hasselt 等人(2016)
    Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, pp.  2094–2100, 2016.
    Van Hasselt, H., Guez, A., 和 Silver, D. 深度强化学习与双 Q 学习。在 AAAI,第 2094-2100 页,2016 年。
  • Van Hoof et al. (2017) Van Hoof 等人(2017) Van Hoof, H., Neumann, G., and Peters, J. Non-parametric policy search with limited information loss. The Journal of Machine Learning Research, 18(1):2472–2517, 2017.
    Van Hoof, H., Neumann, G., 和 Peters, J. 有限信息损失的非参数策略搜索。机器学习研究杂志, 18(1):2472–2517, 2017.
  • Večerík et al. (2017) Večerík 等人(2017) Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
    Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., 和 Riedmiller, M. 利用演示在稀疏奖励的机器人问题上进行深度强化学习。arXiv 预印本 arXiv:1707.08817, 2017.
  • Watkins (1989) Watkins(1989) Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
    沃特金斯,C. J. C. C. 从延迟奖励中学习。剑桥大学国王学院博士论文,1989 年。
  • Xu et al. (2018) 徐等人(2018) Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
    Xu, H., Li, Y., Tian, Y., Darrell, T., 和 Ma, T. 带有理论保证的基于模型的强化学习算法框架. arXiv 预印本 arXiv:1807.03858, 2018.
  • Zhang & Sutton (2017)
    Zhang, S. and Sutton, R. S. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
    张,S.和萨顿,R.S.深入研究经验回放。arXiv 预印本 arXiv:1712.01275,2017。

Appendix A Missing Proofs 附录 A 缺失的证明

A.1 Proofs and Details from Section 4.1
4.1 节的证明和细节

Definition 1. We define a coherent batch \mathcal{B} as a batch such that if (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})\in\mathcal{B} then ssuperscript𝑠s^{\prime}\in\mathcal{B} unless ssuperscript𝑠s^{\prime} is a terminal state.
定义 1. 我们将一个连贯的批次 \mathcal{B}

Definition 2. We define ϵMDP(s,a)=Qπ(s,a)Qπ(s,a)subscriptitalic-ϵMDP𝑠𝑎superscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a) as the error between the true value of a policy π𝜋\pi in the MDP M𝑀M and the value of π𝜋\pi when learned with a batch \mathcal{B}.
定义 2. 我们将 ϵMDP(s,a)=Qπ(s,a)Qπ(s,a)subscriptitalic-ϵMDP𝑠𝑎superscript𝑄𝜋𝑠𝑎subscriptsuperscript𝑄𝜋𝑠𝑎\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a) 定义为在 MDP M𝑀M 中策略 π𝜋\pi 的真实值与使用批处理 \mathcal{B} 学习时的值 π𝜋\pi 之间的误差。

Definition 3. For simplicity in notation, we denote
定义 3. 为了简化表示法,我们表示

ϵMDPπ=sμπ(s)aπ(a|s)|ϵMDP(s,a)|.subscriptsuperscriptitalic-ϵ𝜋MDPsubscript𝑠subscript𝜇𝜋𝑠subscript𝑎𝜋conditional𝑎𝑠subscriptitalic-ϵMDP𝑠𝑎\epsilon^{\pi}_{\text{MDP}}=\sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)|\epsilon_{\text{MDP}}(s,a)|. (14)

To evaluate a policy π𝜋\pi exactly at relevant state-action pairs, only ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0 is required.
要在相关的状态-动作对上精确评估一项政策,只需要 ϵMDPπ=0subscriptsuperscriptitalic-ϵ𝜋MDP0\epsilon^{\pi}_{\text{MDP}}=0

Definition 4. We define the optimal batch-constrained policy πΠsuperscript𝜋subscriptΠ\pi^{*}\in\Pi_{\mathcal{B}} such that Qπ(s,a)Qπ(s,a)superscript𝑄superscript𝜋𝑠𝑎superscript𝑄𝜋𝑠𝑎Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a) for all πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}} and (s,a)𝑠𝑎(s,a)\in\mathcal{B}.
定义 4. 我们定义最优的批量约束策略 πΠsuperscript𝜋subscriptΠ\pi^{*}\in\Pi_{\mathcal{B}} ,使得对于所有 πΠ𝜋subscriptΠ\pi\in\Pi_{\mathcal{B}}(s,a)𝑠𝑎(s,a)\in\mathcal{B}

Algorithm 1. Batch-Constrained Q-learning (BCQL) maintains a tabular value function Q(s,a)𝑄𝑠𝑎Q(s,a) for each possible state-action pair (s,a)𝑠𝑎(s,a). A transition tuple (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) is sampled from the batch \mathcal{B} with uniform probability and the following update rule is applied, with learning rate α𝛼\alpha:
算法 1. 批量约束 Q 学习(BCQL)维护每个可能的状态-动作对 (s,a)𝑠𝑎(s,a) 的表格值函数 Q(s,a)𝑄𝑠𝑎Q(s,a) 。从批量 \mathcal{B} 中均匀采样一个转移元组 (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime}) ,并应用以下更新规则,学习率为 α𝛼\alpha