Off-Policy Deep Reinforcement Learning without Exploration
无探索的离策深度强化学习

Scott Fujimoto 斯科特·藤本 David Meger 大卫·梅格 Doina Precup 多伊娜·普雷库普

Off-Policy Deep Reinforcement Learning without Exploration:
Supplementary Material
无探索的离策深度强化学习：补充材料

Scott Fujimoto 斯科特·藤本 David Meger 大卫·梅格 Doina Precup 多伊娜·普雷库普

Abstract 摘要

Many practical applications of reinforcement learning constrain agents to learn from a fixed batch of data which has already been gathered, without offering further possibility for data collection. In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this fixed batch setting. We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data. We present the first continuous control deep reinforcement learning algorithm which can learn effectively from arbitrary, fixed batch data, and empirically demonstrate the quality of its behavior in several tasks.
许多强化学习的实际应用都限制代理从已经收集的固定数据批次中学习，而不提供进一步的数据收集可能性。本文证明，由于外推引入的错误，标准的离策略深度强化学习算法，如 DQN 和 DDPG，无法学习没有与当前策略下的分布相关的数据，使它们在这种固定批次设置下无效。我们引入了一种新颖的离策略算法类别，批次约束强化学习，它通过限制动作空间来迫使代理在行为上接近于对给定数据子集的策略。我们提出了第一个可以有效地从任意固定批次数据中学习的连续控制深度强化学习算法，并在几个任务中经验性地展示了其行为的质量。

Machine Learning, ICML, Deep Reinforcement Learning, Imitation Learning, Batch Reinforcement Learning, Off-Policy
机器学习，ICML，深度强化学习，模仿学习，批量强化学习，离线策略

1 Introduction 介绍

Batch reinforcement learning, the task of learning from a fixed dataset without further interactions with the environment, is a crucial requirement for scaling reinforcement learning to tasks where the data collection procedure is costly, risky, or time-consuming. Off-policy batch reinforcement learning has important implications for many practical applications. It is often preferable for data collection to be performed by some secondary controlled process, such as a human operator or a carefully monitored program. If assumptions on the quality of the behavioral policy can be made, imitation learning can be used to produce strong policies. However, most imitation learning algorithms are known to fail when exposed to suboptimal trajectories, or require further interactions with the environment to compensate (Hester et al., 2017; Sun et al., 2018; Cheng et al., 2018). On the other hand, batch reinforcement learning offers a mechanism for learning from a fixed dataset without restrictions on the quality of the data.
批量强化学习，即从固定数据集中学习而无需与环境进一步交互的任务，是将强化学习扩展到数据收集程序昂贵、风险高或耗时的任务的关键要求。离线批量强化学习对许多实际应用具有重要意义。通常最好由某些次要受控过程（如人类操作员或仔细监控的程序）执行数据收集。如果可以对行为策略的质量做出假设，那么可以使用模仿学习来生成强策略。然而，大多数模仿学习算法在暴露于次优轨迹时会失败，或需要进一步与环境交互来进行补偿。另一方面，批量强化学习提供了一种从固定数据集中学习的机制，而不限制数据质量。

Most modern off-policy deep reinforcement learning algorithms fall into the category of growing batch learning (Lange et al., 2012), in which data is collected and stored into an experience replay dataset (Lin, 1992), which is used to train the agent before further data collection occurs. However, we find that these “off-policy” algorithms can fail in the batch setting, becoming unsuccessful if the dataset is uncorrelated to the true distribution under the current policy. Our most surprising result shows that off-policy agents perform dramatically worse than the behavioral agent when trained with the same algorithm on the same dataset.
大多数现代离策略深度强化学习算法属于不断增长的批量学习类别（Lange 等，2012 年），其中数据被收集并存储到经验重放数据集（Lin，1992 年）中，该数据集用于在进一步数据收集之前训练代理。然而，我们发现这些“离策略”算法在批量设置中可能会失败，如果数据集与当前策略下的真实分布不相关，则会失败。我们最令人惊讶的结果表明，当使用相同算法在相同数据集上训练时，离策略代理的表现明显比行为代理差。

This inability to learn truly off-policy is due to a fundamental problem with off-policy reinforcement learning we denote extrapolation error, a phenomenon in which unseen state-action pairs are erroneously estimated to have unrealistic values. Extrapolation error can be attributed to a mismatch in the distribution of data induced by the policy and the distribution of data contained in the batch. As a result, it may be impossible to learn a value function for a policy which selects actions not contained in the batch.
这种无法真正离线学习的能力是由于离线强化学习中的一个基本问题，我们称之为外推误差，即未见状态-动作对被错误地估计为具有不切实际的值的现象。外推误差可以归因于由策略引起的数据分布与批处理中包含的数据分布之间的不匹配。因此，可能无法为选择不包含在批处理中的动作的策略学习值函数。

To overcome extrapolation error in off-policy learning, we introduce batch-constrained reinforcement learning, where agents are trained to maximize reward while minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch. Our deep reinforcement learning algorithm, Batch-Constrained deep Q-learning (BCQ), uses a state-conditioned generative model to produce only previously seen actions. This generative model is combined with a Q-network, to select the highest valued action which is similar to the data in the batch. Under mild assumptions, we prove this batch-constrained paradigm is necessary for unbiased value estimation from incomplete datasets for finite deterministic MDPs.
为了克服离线学习中的外推误差，我们引入了批量约束强化学习，其中代理被训练以最大化奖励，同时最小化策略中的状态-动作访问与批量中包含的状态-动作对之间的不匹配。我们的深度强化学习算法，批量约束深度 Q 学习（BCQ），使用状态条件生成模型仅生成先前见过的动作。这个生成模型与 Q 网络结合，选择价值最高的动作，该动作与批量数据中的数据相似。在温和的假设下，我们证明这种批量约束范式对于有限确定性 MDP 的不完整数据集的无偏值估计是必要的。

Unlike any previous continuous control deep reinforcement learning algorithms, BCQ is able to learn successfully without interacting with the environment by considering extrapolation error. Our algorithm is evaluated on a series of batch reinforcement learning tasks in MuJoCo environments (Todorov et al., 2012; Brockman et al., 2016), where extrapolation error is particularly problematic due to the high-dimensional continuous action space, which is impossible to sample exhaustively. Our algorithm offers a unified view on imitation and off-policy learning, and is capable of learning from purely expert demonstrations, as well as from finite batches of suboptimal data, without further exploration. We remark that BCQ is only one way to approach batch-constrained reinforcement learning in a deep setting, and we hope that it will be serve as a foundation for future algorithms. To ensure reproducibility, we provide precise experimental and implementation details, and our code is made available (https://github.com/sfujim/BCQ).
与以往任何连续控制深度强化学习算法不同，BCQ 能够成功学习而无需与环境交互，考虑外推误差。我们的算法在 MuJoCo 环境中的一系列批量强化学习任务中进行评估（Todorov 等，2012 年；Brockman 等，2016 年），其中由于高维连续动作空间，外推误差尤为棘手，无法进行全面采样。我们的算法提供了对模仿和离策略学习的统一视角，并能够从纯专家演示中学习，也能够从有限批次的次优数据中学习，无需进一步探索。我们指出，BCQ 只是在深度设置中处理批量约束强化学习的一种方式，希望它能成为未来算法的基础。为了确保可复现性，我们提供了精确的实验和实现细节，并提供了我们的代码（https://github.com/sfujim/BCQ）。

2 Background 背景

In reinforcement learning, an agent interacts with its environment, typically assumed to be a Markov decision process (MDP) $(\mathcal{S},\mathcal{A},p_{M},r,\gamma)$ , with state space $\mathcal{S}$ , action space $\mathcal{A}$ , and transition dynamics $p_{M}(s^{\prime}|s,a)$ . At each discrete time step, the agent receives a reward $r(s,a,s^{\prime})\in\mathbb{R}$ for performing action $a$ in state $s$ and arriving at the state $s^{\prime}$ . The goal of the agent is to maximize the expectation of the sum of discounted rewards, known as the return $R_{t}=\sum_{i={t+1}}^{\infty}\gamma^{i}r(s_{i},a_{i},s_{i+1})$ , which weighs future rewards with respect to the discount factor $\gamma\in[0,1)$ .
在强化学习中，代理与其环境进行交互，通常假定为马尔可夫决策过程（MDP），具有状态空间，动作空间和转移动态。在每个离散时间步长，代理执行动作并在状态中获得奖励，到达另一个状态。代理的目标是最大化折现奖励之和的期望值，称为回报，它根据折现因子权衡未来奖励。

The agent selects actions with respect to a policy $\pi:\mathcal{S}\rightarrow\mathcal{A}$ , which exhibits a distribution $\mu^{\pi}(s)$ over the states $s\in\mathcal{S}$ visited by the policy. Each policy $\pi$ has a corresponding value function $Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s,a]$ , the expected return when following the policy after taking action $a$ in state $s$ . For a given policy $\pi$ , the value function can be computed through the Bellman operator $\mathcal{T}^{\pi}$ :
代理根据策略 $\pi:\mathcal{S}\rightarrow\mathcal{A}$ 选择动作，该策略在被访问的状态 $s\in\mathcal{S}$ 上展现出一个分布 $\mu^{\pi}(s)$ 。每个策略 $\pi$ 都有一个相应的值函数 $Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s,a]$ ，即在状态 $s$ 中采取动作 $a$ 后遵循该策略的预期回报。对于给定的策略 $\pi$ ，值函数可以通过贝尔曼算子 $\mathcal{T}^{\pi}$ 计算：

\mathcal{T}^{\pi}Q(s,a)=\mathbb{E}_{s^{\prime}}[r+\gamma Q(s^{\prime},\pi(s^{\prime}))].

(1)

The Bellman operator is a contraction for $\gamma\in[0,1)$ with unique fixed point $Q^{\pi}(s,a)$ (Bertsekas & Tsitsiklis, 1996). $Q^{*}(s,a)=\max_{\pi}Q^{\pi}(s,a)$ is known as the optimal value function, which has a corresponding optimal policy obtained through greedy action choices. For large or continuous state and action spaces, the value can be approximated with neural networks, e.g. using the Deep Q-Network algorithm (DQN) (Mnih et al., 2015). In DQN, the value function $Q_{\theta}$ is updated using the target:
贝尔曼算子是一个收缩映射，具有唯一的不动点（Bertsekas＆Tsitsiklis，1996）。这个不动点被称为最优值函数，通过贪婪动作选择获得相应的最优策略。对于大型或连续的状态和动作空间，可以使用神经网络来近似值，例如使用深度 Q 网络算法（DQN）（Mnih 等，2015）。在 DQN 中，值函数通过目标进行更新：

r+\gamma Q_{\theta^{\prime}}(s^{\prime},\pi(s^{\prime})),\quad\pi(s^{\prime})=\operatorname{argmax}_{a}Q_{\theta^{\prime}}(s^{\prime},a),

(2)

Q-learning is an off-policy algorithm (Sutton & Barto, 1998), meaning the target can be computed without consideration of how the experience was generated. In principle, off-policy reinforcement learning algorithms are able to learn from data collected by any behavioral policy. Typically, the loss is minimized over mini-batches of tuples of the agent’s past data, $(s,a,r,s^{\prime})\in\mathcal{B}$ , sampled from an experience replay dataset $\mathcal{B}$ (Lin, 1992). For shorthand, we often write $s\in\mathcal{B}$ if there exists a transition tuple containing $s$ in the batch $\mathcal{B}$ , and similarly for $(s,a)$ or $(s,a,s^{\prime})\in\mathcal{B}$ . In batch reinforcement learning, we assume $\mathcal{B}$ is fixed and no further interaction with the environment occurs. To further stabilize learning, a target network with frozen parameters $Q_{\theta^{\prime}}$ , is used in the learning target. The parameters of the target network $\theta^{\prime}$ are updated to the current network parameters $\theta$ after a fixed number of time steps, or by averaging $\theta^{\prime}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}$ for some small $\tau$ (Lillicrap et al., 2015).
Q-learning 是一种离策略算法（Sutton＆Barto，1998），意味着目标可以在不考虑经验生成方式的情况下计算。原则上，离策略强化学习算法能够从任何行为策略收集的数据中学习。通常，损失在代理过去数据的元组小批量 $(s,a,r,s^{\prime})\in\mathcal{B}$

In a continuous action space, the analytic maximum of Equation (2) is intractable. In this case, actor-critic methods are commonly used, where action selection is performed through a separate policy network $\pi_{\phi}$ , known as the actor, and updated with respect to a value estimate, known as the critic (Sutton & Barto, 1998; Konda & Tsitsiklis, 2003). This policy can be updated following the deterministic policy gradient theorem (Silver et al., 2014):
在连续动作空间中，方程（2）的解析最大值是难以处理的。在这种情况下，通常使用演员-评论家方法，其中动作选择通过一个单独的策略网络（演员）进行，并根据价值估计（评论家）进行更新（Sutton＆Barto，1998；Konda＆Tsitsiklis，2003）。该策略可以根据确定性策略梯度定理进行更新（Silver 等人，2014）：

\phi\leftarrow\operatorname{argmax}_{\phi}\mathbb{E}_{s\in\mathcal{B}}[Q_{\theta}(s,\pi_{\phi}(s))],

(3)

which corresponds to learning an approximation to the maximum of $Q_{\theta}$ , by propagating the gradient through both $\pi_{\phi}$ and $Q_{\theta}$ . When combined with off-policy deep Q-learning to learn $Q_{\theta}$ , this algorithm is referred to as Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015).
这对应于通过在 $\pi_{\phi}$ 和 $Q_{\theta}$ 中传播梯度来学习 $Q_{\theta}$ 的近似值。当与离策略深度 Q 学习相结合以学习 $Q_{\theta}$ 时，该算法被称为深度确定性策略梯度（DDPG）（Lillicrap 等，2015）。

3 Extrapolation Error 外推误差

Extrapolation error is an error in off-policy value learning which is introduced by the mismatch between the dataset and true state-action visitation of the current policy. The value estimate $Q(s,a)$ is affected by extrapolation error during a value update where the target policy selects an unfamiliar action $a^{\prime}$ at the next state $s^{\prime}$ in the backed-up value estimate, such that $(s^{\prime},a^{\prime})$ is unlikely, or not contained, in the dataset. The cause of extrapolation error can be attributed to several related factors:
外推误差是离线策略价值学习中的一个错误，由于数据集与当前策略的真实状态-动作访问不匹配而引入。在值更新过程中，值估计 $Q(s,a)$ 受到外推误差的影响，其中目标策略在备份的值估计中选择了下一个状态 $s^{\prime}$ 的陌生动作 $a^{\prime}$ ，使得 $(s^{\prime},a^{\prime})$ 不太可能或不包含在数据集中。外推误差的原因可以归因于几个相关因素：

Absent Data. If any state-action pair $(s,a)$ is unavailable, then error is introduced as some function of the amount of similar data and approximation error. This means that the estimate of $Q_{\theta}(s^{\prime},\pi(s^{\prime}))$ may be arbitrarily bad without sufficient data near $(s^{\prime},\pi(s^{\prime}))$ .
缺失数据。如果任何状态-动作对 $(s,a)$ 不可用，则会引入错误，该错误是某些相似数据量和近似误差的函数。这意味着在 $(s^{\prime},\pi(s^{\prime}))$ 附近没有足够的数据时，对 $Q_{\theta}(s^{\prime},\pi(s^{\prime}))$ 的估计可能会任意糟糕。

Model Bias. When performing off-policy Q-learning with a batch $\mathcal{B}$ , the Bellman operator $\mathcal{T}^{\pi}$ is approximated by sampling transitions tuples $(s,a,r,s^{\prime})$ from $\mathcal{B}$ to estimate the expectation over $s^{\prime}$ . However, for a stochastic MDP, without infinite state-action visitation, this produces a biased estimate of the transition dynamics:
模型偏差。在使用批处理 $\mathcal{B}$ 执行离策略 Q 学习时，Bellman 算子 $\mathcal{T}^{\pi}$ 通过从 $\mathcal{B}$ 中采样转换元组 $(s,a,r,s^{\prime})$ 来近似，以估计 $s^{\prime}$ 上的期望。然而，对于随机 MDP，没有无限状态-动作访问，这会产生对转换动态的有偏估计：

\mathcal{T}^{\pi}Q(s,a)\approx\mathbb{E}_{s^{\prime}\sim\mathcal{B}}[r+\gamma Q(s^{\prime},\pi(s^{\prime}))],

(4)

where the expectation is with respect to transitions in the batch $\mathcal{B}$ , rather than the true MDP.
期望是针对批次 $\mathcal{B}$ 中的转换，而不是真实的 MDP。

Training Mismatch. Even with sufficient data, in deep Q-learning systems, transitions are sampled uniformly from the dataset, giving a loss weighted with respect to the likelihood of data in the batch:
训练不匹配。即使有足够的数据，在深度 Q 学习系统中，转换是从数据集中均匀抽样的，导致损失与批次数据的可能性加权。

\approx\frac{1}{|\mathcal{B}|}\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}||r+\gamma Q_{\theta^{\prime}}(s^{\prime},\pi(s^{\prime}))-Q_{\theta}(s,a)||^{2}.

(5)

If the distribution of data in the batch does not correspond with the distribution under the current policy, the value function may be a poor estimate of actions selected by the current policy, due to the mismatch in training.
如果批处理数据的分布与当前策略下的分布不符，则价值函数可能会对当前策略选择的动作进行较差的估计，因为训练不匹配。

We remark that re-weighting the loss in Equation (5) with respect to the likelihood under the current policy can still result in poor estimates if state-action pairs with high likelihood under the current policy are not found in the batch. This means only a subset of possible policies can be evaluated accurately. As a result, learning a value estimate with off-policy data can result in large amounts of extrapolation error if the policy selects actions which are not similar to the data found in the batch. In the following section, we discuss how state of the art off-policy deep reinforcement learning algorithms fail to address the concern of extrapolation error, and demonstrate the implications in practical examples.
我们注意到，根据当前策略下的似然性重新加权方程（5）中的损失，如果在批处理中找不到在当前策略下具有高似然性的状态-动作对，仍可能导致估计不准确。这意味着只有可能的策略子集才能被准确评估。因此，使用离策略数据学习价值估计，如果策略选择的动作与批处理中的数据不相似，可能会导致大量外推误差。在接下来的部分中，我们将讨论最先进的离策略深度强化学习算法未能解决外推误差的问题，并在实际示例中展示其影响。

3.1 Extrapolation Error in Deep Reinforcement Learning
深度强化学习中的外推误差

Deep Q-learning algorithms (Mnih et al., 2015) have been labeled as off-policy due to their connection to off-policy Q-learning (Watkins, 1989). However, these algorithms tend to use near-on-policy exploratory policies, such as $\epsilon$ -greedy, in conjunction with a replay buffer (Lin, 1992). As a result, the generated dataset tends to be heavily correlated to the current policy. In this section, we examine how these off-policy algorithms perform when learning with uncorrelated datasets. Our results demonstrate that the performance of a state of the art deep actor-critic algorithm, DDPG (Lillicrap et al., 2015), deteriorates rapidly when the data is uncorrelated and the value estimate produced by the deep Q-network diverges. These results suggest that off-policy deep reinforcement learning algorithms are ineffective when learning truly off-policy.
深度 Q 学习算法（Mnih 等，2015 年）被标记为离策略，因为它们与离策略 Q 学习（Watkins，1989 年）有关。然而，这些算法往往使用接近策略的探索策略，如ε-贪心，结合重放缓冲区（Lin，1992 年）。因此，生成的数据集往往与当前策略高度相关。在本节中，我们将研究这些离策略算法在学习与不相关数据集时的表现。我们的结果表明，一种最先进的深度 actor-critic 算法 DDPG（Lillicrap 等，2015 年）在数据不相关且由深度 Q 网络产生的值估计发散时，性能会迅速下降。这些结果表明，当真正离策略学习时，离策略深度强化学习算法是无效的。

Our practical experiments examine three different batch settings in OpenAI gym’s Hopper-v1 environment (Todorov et al., 2012; Brockman et al., 2016), which we use to train an off-policy DDPG agent with no interaction with the environment. Experiments with additional environments and specific details can be found in the Supplementary Material.
我们的实验在 OpenAI gym 的 Hopper-v1 环境中检查了三种不同的批处理设置（Todorov 等，2012 年；Brockman 等，2016 年），我们使用这些设置来训练一个无需与环境交互的离线 DDPG 代理。有关其他环境和具体细节的实验可以在补充材料中找到。

Refer to caption — (a) Final buffer 最终缓冲区
performance 表现

Batch 1 (Final buffer). We train a DDPG agent for 1 million time steps, adding $\mathcal{N}(0,0.5)$ Gaussian noise to actions for high exploration, and store all experienced transitions. This collection procedure creates a dataset with a diverse set of states and actions, with the aim of sufficient coverage.
批次 1（最终缓冲区）。我们训练一个 DDPG 代理器进行 100 万个时间步，对动作添加高探索性的高斯噪声，并存储所有经验转换。这个收集过程创建了一个包含各种状态和动作的数据集，旨在实现充分覆盖。

Batch 2 (Concurrent). We concurrently train the off-policy and behavioral DDPG agents, for 1 million time steps. To ensure sufficient exploration, a standard $\mathcal{N}(0,0.1)$ Gaussian noise is added to actions taken by the behavioral policy. Each transition experienced by the behavioral policy is stored in a buffer replay, which both agents learn from. As a result, both agents are trained with the identical dataset.
第二批（并发）。我们同时训练离策略和行为 DDPG 代理，进行 100 万个时间步。为了确保足够的探索，标准的高斯噪声被添加到行为策略采取的动作中。行为策略经历的每个转换都存储在一个缓冲重放中，两个代理都从中学习。因此，两个代理都使用相同的数据集进行训练。

Batch 3 (Imitation). A trained DDPG agent acts as an expert, and is used to collect a dataset of 1 million transitions.
第三批（模拟）。经过训练的 DDPG 代理充当专家，并用于收集 100 万个转换的数据集。

In Figure 1, we graph the performance of the agents as they train with each batch, as well as their value estimates. Straight lines represent the average return of episodes contained in the batch. Additionally, we graph the learning performance of the behavioral agent for the relevant tasks.
在图 1 中，我们绘制了代理随着每个批次训练的表现，以及它们的价值估计。直线代表批次中包含的剧集的平均回报。此外，我们还绘制了行为代理在相关任务中的学习表现。

Our experiments demonstrate several surprising facts about off-policy deep reinforcement learning agents. In each task, the off-policy agent performances significantly worse than the behavioral agent. Even in the concurrent experiment, where both agents are trained with the same dataset, there is a large gap in performance in every single trial. This result suggests that differences in the state distribution under the initial policies is enough for extrapolation error to drastically offset the performance of the off-policy agent. Additionally, the corresponding value estimate exhibits divergent behavior, while the value estimate of the behavioral agent is highly stable. In the final buffer experiment, the off-policy agent is provided with a large and diverse dataset, with the aim of providing sufficient coverage of the initial policy. Even in this instance, the value estimate is highly unstable, and the performance suffers. In the imitation setting, the agent is provided with expert data. However, the agent quickly learns to take non-expert actions, under the guise of optimistic extrapolation. As a result, the value estimates rapidly diverge and the agent fails to learn.
我们的实验展示了关于离策略深度强化学习代理的几个令人惊讶的事实。在每个任务中，离策略代理的表现明显比行为代理差。即使在并发实验中，两个代理都使用相同的数据集进行训练，每次试验的表现差距仍然很大。这一结果表明，初始策略下状态分布的差异足以使外策略代理的性能大幅下降。此外，相应的价值估计表现出分歧行为，而行为代理的价值估计非常稳定。在最终的缓冲实验中，离策略代理提供了一个大而多样的数据集，旨在提供足够覆盖初始策略的数据。即使在这种情况下，价值估计也非常不稳定，性能受到影响。在模仿设置中，代理被提供专家数据。然而，代理很快学会采取非专家行动，伪装成乐观的外推。结果，价值估计迅速分歧，代理无法学习。

Although extrapolation error is not necessarily positively biased, when combined with maximization in reinforcement learning algorithms, extrapolation error provides a source of noise that can induce a persistent overestimation bias (Thrun & Schwartz, 1993; Van Hasselt et al., 2016; Fujimoto et al., 2018). In an on-policy setting, extrapolation error may be a source of beneficial exploration through an implicit “optimism in the face of uncertainty” strategy (Lai & Robbins, 1985; Jaksch et al., 2010). In this case, if the value function overestimates an unknown state-action pair, the policy will collect data in the region of uncertainty, and the value estimate will be corrected. However, when learning off-policy, or in a batch setting, extrapolation error will never be corrected due to the inability to collect new data.
尽管外推误差不一定是正向偏差，但当与强化学习算法中的最大化结合时，外推误差会产生一种噪音源，可能导致持续的高估偏差。在一个在线策略设置中，外推误差可能是一种有益的探索源，通过一种隐含的“面对不确定性的乐观”策略。在这种情况下，如果价值函数高估了一个未知的状态-动作对，策略将在不确定性区域收集数据，并且价值估计将被修正。然而，在离线学习或批处理设置中，外推误差永远不会被纠正，因为无法收集新数据。

These experiments show extrapolation error can be highly detrimental to learning off-policy in a batch reinforcement learning setting. While the continuous state space and multi-dimensional action space in MuJoCo environments are contributing factors to extrapolation error, the scale of these tasks is small compared to real world settings. As a result, even with a sufficient amount of data collection, extrapolation error may still occur due to the concern of catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow et al., 2013). Consequently, off-policy reinforcement learning algorithms used in the real-world will require practical guarantees without exhaustive amounts of data.
这些实验表明，在批量强化学习设置中，外推误差可能对离线学习产生严重不利影响。虽然 MuJoCo 环境中的连续状态空间和多维动作空间是外推误差的 contributing factors，但与真实世界环境相比，这些任务的规模较小。因此，即使有足够的数据收集，由于对灾难性遗忘的担忧，外推误差仍可能发生。因此，在真实世界中使用的离线强化学习算法将需要在不需要大量数据的情况下提供实际保证。

4 Batch-Constrained Reinforcement Learning
批次约束强化学习

Current off-policy deep reinforcement learning algorithms fail to address extrapolation error by selecting actions with respect to a learned value estimate, without consideration of the accuracy of the estimate. As a result, certain out-of-distribution actions can be erroneously extrapolated to higher values. However, the value of an off-policy agent can be accurately evaluated in regions where data is available. We propose a conceptually simple idea: to avoid extrapolation error a policy should induce a similar state-action visitation to the batch. We denote policies which satisfy this notion as batch-constrained. To optimize off-policy learning for a given batch, batch-constrained policies are trained to select actions with respect to three objectives:
当前的离策略深度强化学习算法未能解决外推误差问题，因为它们选择动作时只考虑了学习到的值估计，而没有考虑估计的准确性。因此，某些超出分布的动作可能会错误地被外推到更高的值。然而，离策略代理的值可以在数据可用的区域准确评估。我们提出了一个概念上简单的想法：为了避免外推误差，一个策略应该引导类似的状态-动作访问到批次中。我们将满足这一概念的策略称为批次约束。为了优化给定批次的离策略学习，批次约束策略被训练为根据三个目标选择动作：

(1)

Minimize the distance of selected actions to the data in the batch.

将所选动作与批处理数据之间的距离最小化。
(2)

Lead to states where familiar data can be observed.

导致可以观察到熟悉数据的状态。
(3)

Maximize the value function.

最大化价值函数。

We note the importance of objective (1) above the others, as the value function and estimates of future states may be arbitrarily poor without access to the corresponding transitions. That is, we cannot correctly estimate (2) and (3) unless (1) is sufficiently satisfied. As a result, we propose optimizing the value function, along with some measure of future certainty, with a constraint limiting the distance of selected actions to the batch. This is achieved in our deep reinforcement learning algorithm through a state-conditioned generative model, to produce likely actions under the batch. This generative model is combined with a network which aims to optimally perturb the generated actions in a small range, along with a Q-network, used to select the highest valued action. Finally, we train a pair of Q-networks, and take the minimum of their estimates during the value update. This update penalizes states which are unfamiliar, and pushes the policy to select actions which lead to certain data.
我们注意到目标（1）的重要性高于其他目标，因为价值函数和未来状态的估计可能会非常糟糕，如果没有对应转换的话。也就是说，除非目标（1）得到充分满足，否则我们无法正确估计（2）和（3）。因此，我们建议优化价值函数，以及一些未来确定性的度量，同时通过限制所选动作与批处理之间的距离来实现。在我们的深度强化学习算法中，通过一个状态条件的生成模型来实现这一点，以在批处理下生成可能的动作。这个生成模型与一个网络结合在一起，旨在在一个小范围内最优地扰动生成的动作，同时还有一个 Q 网络，用于选择价值最高的动作。最后，我们训练一对 Q 网络，并在价值更新过程中取它们估计的最小值。这种更新惩罚了不熟悉的状态，并推动策略选择导致确定数据的动作。

We begin by analyzing the theoretical properties of batch-constrained policies in a finite MDP setting, where we are able to quantify extrapolation error precisely. We then introduce our deep reinforcement learning algorithm in detail, Batch-Constrained deep Q-learning (BCQ) by drawing inspiration from the tabular analogue.
我们首先分析了在有限 MDP 环境中批量约束策略的理论特性，我们能够精确量化外推误差。然后，我们通过从表格模拟中汲取灵感，详细介绍了我们的深度强化学习算法，批量约束深度 Q 学习（BCQ）。

4.1 Addressing Extrapolation Error in Finite MDPs
在有限 MDP 中解决外推误差

In the finite MDP setting, extrapolation error can be described by the bias from the mismatch between the transitions contained in the buffer and the true MDP. We find that by inducing a data distribution that is contained entirely within the batch, batch-constrained policies can eliminate extrapolation error entirely for deterministic MDPs. In addition, we show that the batch-constrained variant of Q-learning converges to the optimal policy under the same conditions as the standard form of Q-learning. Moreover, we prove that for a deterministic MDP, batch-constrained Q-learning is guaranteed to match, or outperform, the behavioral policy when starting from any state contained in the batch. All of the proofs for this section can be found in the Supplementary Material.
在有限 MDP 设置中，外推误差可以被描述为缓冲区中包含的转换与真实 MDP 之间的不匹配所导致的偏差。我们发现，通过引入一个完全包含在批处理中的数据分布，批处理约束策略可以完全消除确定性 MDP 的外推误差。此外，我们证明了 Q-learning 的批处理约束变体在与标准形式的 Q-learning 相同的条件下收敛到最优策略。此外，我们证明了对于确定性 MDP，批处理约束 Q-learning 在从批处理中的任何状态开始时保证与或优于行为策略相匹配。本节的所有证明都可以在补充材料中找到。

A value estimate $Q$ can be learned using an experience replay buffer $\mathcal{B}$ . This involves sampling transition tuples $(s,a,r,s^{\prime})$ with uniform probability, and applying the temporal difference update (Sutton, 1988; Watkins, 1989):
可以使用经验重放缓冲区学习价值估计。这涉及使用均匀概率对转换元组进行抽样，并应用时间差异更新（Sutton, 1988; Watkins, 1989）：

Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha\left(r+\gamma Q(s^{\prime},\pi(s^{\prime}))\right).

(6)

If $\pi(s^{\prime})=\operatorname*{argmax}_{a^{\prime}}Q(s^{\prime},a^{\prime})$ , this is known as Q-learning. Assuming a non-zero probability of sampling any possible transition tuple from the buffer and infinite updates, Q-learning converges to the optimal value function.
如果是这样，这就被称为 Q 学习。假设从缓冲区中采样任何可能的转移元组的概率不为零，并且更新无限次，Q 学习将收敛到最优值函数。

We begin by showing that the value function $Q$ learned with the batch $\mathcal{B}$ corresponds to the value function for an alternate MDP $M_{\mathcal{B}}$ . From the true MDP $M$ and initial values $Q(s,a)$ , we define the new MDP $M_{\mathcal{B}}$ with the same action and state space as $M$ , along with an additional terminal state $s_{\text{init}}$ . $M_{\mathcal{B}}$ has transition probabilities $p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}$ , where $N(s,a,s^{\prime})$ is the number of times the tuple $(s,a,s^{\prime})$ is observed in $\mathcal{B}$ . If $\sum_{\tilde{s}}N(s,a,\tilde{s})=0$ , then $p_{\mathcal{B}}(s_{\text{init}}|s,a)=1$ , where $r(s,a,s_{\text{init}})$ is set to the initialized value of $Q(s,a)$ .
我们首先展示，使用批处理学习的值函数 $Q$ 对应于备用 MDP $M_{\mathcal{B}}$ 的值函数。从真实的 MDP $M$ 和初始值 $Q(s,a)$ 开始，我们定义具有与 $M$ 相同动作和状态空间的新 MDP $M_{\mathcal{B}}$ ，以及一个额外的终止状态 $s_{\text{init}}$ 。 $M_{\mathcal{B}}$ 具有转移概率 $p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}$ ，其中 $N(s,a,s^{\prime})$ 是元组 $(s,a,s^{\prime})$ 在 $\mathcal{B}$ 中观察到的次数。如果 $\sum_{\tilde{s}}N(s,a,\tilde{s})=0$ ，则 $p_{\mathcal{B}}(s_{\text{init}}|s,a)=1$ ，其中 $r(s,a,s_{\text{init}})$ 设置为 $Q(s,a)$ 的初始化值。

Theorem 1. Performing Q-learning by sampling from a batch $\mathcal{B}$ converges to the optimal value function under the MDP $M_{\mathcal{B}}$ .
定理 1. 通过从批次 $\mathcal{B}$ 中抽样执行 Q-learning 收敛到 MDP $M_{\mathcal{B}}$ 下的最优值函数。

We define $\epsilon_{\text{MDP}}$ as the tabular extrapolation error, which accounts for the discrepancy between the value function $Q^{\pi}_{\mathcal{B}}$ computed with the batch $\mathcal{B}$ and the value function $Q^{\pi}$ computed with the true MDP $M$ :
我们将 $\epsilon_{\text{MDP}}$ 定义为表格外推误差，它考虑了使用批处理 $\mathcal{B}$ 计算的值函数 $Q^{\pi}_{\mathcal{B}}$ 与使用真实 MDP $M$ 计算的值函数 $Q^{\pi}$ 之间的差异：

\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a).

(7)

For any policy $\pi$ , the exact form of $\epsilon_{\text{MDP}}(s,a)$ can be computed through a Bellman-like equation:
对于任何政策 $\pi$ ，可以通过类似贝尔曼方程的方式计算出 $\epsilon_{\text{MDP}}(s,a)$ 的确切形式：

\begin{split}\epsilon_{\text{MDP}}(s,a)~{}&=\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\\ &\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime}).\end{split}

(8)

This means extrapolation error is a function of divergence in the transition distributions, weighted by value, along with the error at succeeding states. If the policy is chosen carefully, the error between value functions can be minimized by visiting regions where the transition distributions are similar. For simplicity, we denote
这意味着外推误差是转移分布差异的函数，由价值加权，以及后续状态的误差。如果策略选择得当，通过访问转移分布相似的区域，可以最小化值函数之间的误差。为简单起见，我们表示

\epsilon^{\pi}_{\text{MDP}}=\sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)|\epsilon_{\text{MDP}}(s,a)|.

(9)

To evaluate a policy $\pi$ exactly at relevant state-action pairs, only $\epsilon^{\pi}_{\text{MDP}}=0$ is required. We can then determine the condition required to evaluate the exact expected return of a policy without extrapolation error.
要在相关的状态-动作对上精确评估一项政策，只需要 $\epsilon^{\pi}_{\text{MDP}}=0$ 。然后我们可以确定评估一项政策的精确预期回报所需的条件，而不会出现外推误差。

Lemma 1. For all reward functions, $\epsilon^{\pi}_{\text{MDP}}=0$ if and only if $p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a)$ for all $s^{\prime}\in\mathcal{S}$ and $(s,a)$ such that $\mu_{\pi}(s)>0$ and $\pi(a|s)>0$ .
引理 1. 对于所有的奖励函数， $\epsilon^{\pi}_{\text{MDP}}=0$ 当且仅当对于所有的 $s^{\prime}\in\mathcal{S}$ 和 $(s,a)$ ，满足 $\mu_{\pi}(s)>0$ 和 $\pi(a|s)>0$ 。

Lemma 1 states that if $M_{\mathcal{B}}$ and $M$ exhibit the same transition probabilities in regions of relevance, the policy can be accurately evaluated. For a stochastic MDP this may require an infinite number of samples to converge to the true distribution, however, for a deterministic MDP this requires only a single transition. This means a policy which only traverses transitions contained in the batch, can be evaluated without error. More formally, we denote a policy $\pi\in\Pi_{\mathcal{B}}$ as batch-constrained if for all $(s,a)$ where $\mu_{\pi}(s)>0$ and $\pi(a|s)>0$ then $(s,a)\in\mathcal{B}$ . Additionally, we define a batch $\mathcal{B}$ as coherent if for all $(s,a,s^{\prime})\in\mathcal{B}$ then $s^{\prime}\in\mathcal{B}$ unless $s^{\prime}$ is a terminal state. This condition is trivially satisfied if the data is collected in trajectories, or if all possible states are contained in the batch. With a coherent batch, we can guarantee the existence of a batch-constrained policy.
引理 1 指出，如果 $M_{\mathcal{B}}$ 和 $M$ 在相关区域展示相同的转移概率，那么可以准确评估策略。对于随机 MDP，这可能需要无限数量的样本才能收敛到真实分布，然而对于确定性 MDP，只需要一个转移。这意味着只遍历批次中包含的转移的策略可以无误地进行评估。更正式地，我们将一个策略 $\pi\in\Pi_{\mathcal{B}}$ 称为批次受限的，如果对于所有 $(s,a)$ ，其中 $\mu_{\pi}(s)>0$ 和 $\pi(a|s)>0$ ，则 $(s,a)\in\mathcal{B}$ 。此外，我们定义一个批次 $\mathcal{B}$ 为一致的，如果对于所有 $(s,a,s^{\prime})\in\mathcal{B}$ ，则 $s^{\prime}\in\mathcal{B}$ ，除非 $s^{\prime}$ 是一个终止状态。如果数据是按轨迹收集的，或者所有可能的状态都包含在批次中，则这个条件显然是满足的。有了一致的批次，我们可以保证存在一个受批次限制的策略的存在。

Theorem 2. For a deterministic MDP and all reward functions, $\epsilon^{\pi}_{\text{MDP}}=0$ if and only if the policy $\pi$ is batch-constrained. Furthermore, if $\mathcal{B}$ is coherent, then such a policy must exist if the start state $s_{0}\in\mathcal{B}$ .
定理 2. 对于一个确定性 MDP 和所有奖励函数， $\epsilon^{\pi}_{\text{MDP}}=0$ 当且仅当策略 $\pi$ 是批次受限的。此外，如果 $\mathcal{B}$ 是连贯的，那么这样的策略必须存在，如果起始状态 $s_{0}\in\mathcal{B}$ 。

Batch-constrained policies can be used in conjunction with Q-learning to form batch-constrained Q-learning (BCQL), which follows the standard tabular Q-learning update while constraining the possible actions with respect to the batch:
批量约束策略可以与 Q 学习结合使用，形成批量约束 Q 学习（BCQL），其遵循标准的表格 Q 学习更新，同时限制了与批量相关的可能动作：

Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}\text{s.t.}(s^{\prime},a^{\prime})\in\mathcal{B}}Q(s^{\prime},a^{\prime})).

(10)

BCQL converges under the same conditions as the standard form of Q-learning, noting the batch-constraint is nonrestrictive given infinite state-action visitation.
BCQL 在相同条件下收敛，注意到批量约束在状态-动作访问无限时是非限制性的。

Theorem 3. Given the Robbins-Monro stochastic convergence conditions on the learning rate $\alpha$ , and standard sampling requirements from the environment, BCQL converges to the optimal value function $Q^{*}$ .
定理 3. 在学习率 $\alpha$

The more interesting property of BCQL is that for a deterministic MDP and any coherent batch $\mathcal{B}$ , BCQL converges to the optimal batch-constrained policy $\pi^{*}\in\Pi_{\mathcal{B}}$ such that $Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a)$ for all $\pi\in\Pi_{\mathcal{B}}$ and $(s,a)\in\mathcal{B}$ .
BCQL 的更有趣的特性是，对于一个确定性 MDP 和任何一致的批次 $\mathcal{B}$

Theorem 4. Given a deterministic MDP and coherent batch $\mathcal{B}$ , along with the Robbins-Monro stochastic convergence conditions on the learning rate $\alpha$ and standard sampling requirements on the batch $\mathcal{B}$ , BCQL converges to $Q^{\pi}_{\mathcal{B}}(s,a)$ where $\pi^{*}(s)=\operatorname*{argmax}_{a\text{ s.t.}(s,a)\in\mathcal{B}}Q^{\pi}_{\mathcal{B}}(s,a)$ is the optimal batch-constrained policy.
定理 4. 给定确定性 MDP 和一致的批次 $\mathcal{B}$

This means that BCQL is guaranteed to outperform any behavioral policy when starting from any state contained in the batch, effectively outperforming imitation learning. Unlike standard Q-learning, there is no condition on state-action visitation, other than coherency in the batch.
这意味着 BCQL 保证在从批处理中的任何状态开始时优于任何行为策略，有效地超越了模仿学习。与标准 Q 学习不同，除了批处理中的一致性外，对状态-动作访问没有条件。

4.2 Batch-Constrained Deep Reinforcement Learning
4.2 批次约束深度强化学习

We introduce our approach to off-policy batch reinforcement learning, Batch-Constrained deep Q-learning (BCQ). BCQ approaches the notion of batch-constrained through a generative model. For a given state, BCQ generates plausible candidate actions with high similarity to the batch, and then selects the highest valued action through a learned Q-network. Furthermore, we bias this value estimate to penalize rare, or unseen, states through a modification to Clipped Double Q-learning (Fujimoto et al., 2018). As a result, BCQ learns a policy with a similar state-action visitation to the data in the batch, as inspired by the theoretical benefits of its tabular counterpart.
我们介绍了我们的离线批量强化学习方法，批量约束深度 Q 学习（BCQ）。BCQ 通过生成模型来处理批量约束的概念。对于给定的状态，BCQ 生成与批量高度相似的可信候选动作，然后通过学习的 Q 网络选择价值最高的动作。此外，我们通过对 Clipped Double Q 学习（Fujimoto 等人，2018）进行修改，对罕见或未见状态进行惩罚以偏置这个价值估计。因此，BCQ 学习了一个与批量数据中的状态-动作访问类似的策略，受到其表格对应物理论优势的启发。

To maintain the notion of batch-constraint, we define a similarity metric by making the assumption that for a given state $s$ , the similarity between $(s,a)$ and the state-action pairs in the batch $\mathcal{B}$ can be modelled using a learned state-conditioned marginal likelihood $P_{\mathcal{B}}^{G}(a|s)$ . In this case, it follows that the policy maximizing $P_{\mathcal{B}}^{G}(a|s)$ would minimize the error induced by extrapolation from distant, or unseen, state-action pairs, by only selecting the most likely actions in the batch with respect to a given state. Given the difficulty of estimating $P_{\mathcal{B}}^{G}(a|s)$ in high-dimensional continuous spaces, we instead train a parametric generative model of the batch $G_{\omega}(s)$ , which we can sample actions from, as a reasonable approximation to $\operatorname*{argmax}_{a}P_{\mathcal{B}}^{G}(a|s)$ .
为了保持批次约束的概念，我们通过假设对于给定状态 $s$ ，定义了一个相似度度量，该度量可以用学习的状态条件边际似然 $P_{\mathcal{B}}^{G}(a|s)$ 来建模 $(s,a)$ 和批次 $\mathcal{B}$ 中的状态-动作对之间的相似度。在这种情况下，遵循这样一个策略，最大化 $P_{\mathcal{B}}^{G}(a|s)$ 的策略将通过仅选择相对于给定状态在批次中最可能的动作来最小化由远距离或未见的状态-动作对引起的误差。鉴于在高维连续空间中估计 $P_{\mathcal{B}}^{G}(a|s)$ 的困难，我们改为训练批次 $G_{\omega}(s)$ 的参数生成模型，我们可以从中采样动作，作为 $\operatorname*{argmax}_{a}P_{\mathcal{B}}^{G}(a|s)$ 的合理近似。

For our generative model we use a conditional variational auto-encoder (VAE) (Kingma & Welling, 2013; Sohn et al., 2015), which models the distribution by transforming an underlying latent space¹¹1See the Supplementary Material for an introduction to VAEs.. The generative model $G_{\omega}$ , alongside the value function $Q_{\theta}$ , can be used as a policy by sampling $n$ actions from $G_{\omega}$ and selecting the highest valued action according to the value estimate $Q_{\theta}$ . To increase the diversity of seen actions, we introduce a perturbation model $\xi_{\phi}(s,a,\Phi)$ , which outputs an adjustment to an action $a$ in the range $[-\Phi,\Phi]$ . This enables access to actions in a constrained region, without having to sample from the generative model a prohibitive number of times. This results in the policy $\pi$ :
对于我们的生成模型，我们使用了一个条件变分自动编码器（VAE）（Kingma＆Welling，2013 年; Sohn 等，2015 年），它通过转换潜在的潜在空间来建模分布。生成模型，连同价值函数，可以作为策略使用，通过从中采样动作并根据价值估计选择价值最高的动作。为了增加所见动作的多样性，我们引入了一个扰动模型，它输出一个在范围内的动作调整。这使得可以访问受限区域内的动作，而无需从生成模型中采样过多次。这导致了策略：

\begin{split}\pi(s)=&\operatorname*{argmax}_{a_{i}+\xi_{\phi}(s,a_{i},\Phi)}Q_{\theta}(s,a_{i}+\xi_{\phi}(s,a_{i},\Phi)),\\ &\{a_{i}\sim G_{\omega}(s)\}_{i=1}^{n}.\end{split}

(11)

The choice of $n$ and $\Phi$ creates a trade-off between an imitation learning and reinforcement learning algorithm. If $\Phi=0$ , and the number of sampled actions $n=1$ , then the policy resembles behavioral cloning and as $\Phi\rightarrow a_{\text{max}}-a_{\text{min}}$ and $n\rightarrow\infty$ , then the algorithm approaches Q-learning, as the policy begins to greedily maximize the value function over the entire action space.
选择 $n$ 和 $\Phi$ 之间存在着模仿学习和强化学习算法之间的权衡。如果 $\Phi=0$ ，以及采样动作的数量 $n=1$ ，那么策略类似于行为克隆，当 $\Phi\rightarrow a_{\text{max}}-a_{\text{min}}$ 和 $n\rightarrow\infty$ 时，该算法接近 Q 学习，因为策略开始贪婪地最大化整个动作空间的值函数。

Algorithm 1 BCQ 算法 1 BCQ

Input: Batch

\mathcal{B}

, horizon

T

, target network update rate

\tau

, mini-batch size

N

, max perturbation

\Phi

, number of sampled actions

n

, minimum weighting

\lambda

.
输入：批次

\mathcal{B}

，时间跨度

T

，目标网络更新率

\tau

，小批量大小

N

，最大扰动

\Phi

，采样动作数量

n

，最小加权

\lambda

。

Initialize Q-networks

Q_{\theta_{1}},Q_{\theta_{2}}

, perturbation network

\xi_{\phi}

, and VAE

G_{\omega}=\{E_{\omega_{1}},D_{\omega_{2}}\}

, with random parameters

\theta_{1}

\theta_{2}

\phi

\omega

, and target networks

Q_{\theta^{\prime}_{1}},Q_{\theta^{\prime}_{2}}

\xi_{\phi^{\prime}}

with

\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}

\phi^{\prime}\leftarrow\phi

.
使用随机参数

\theta_{1}

，

\theta_{2}

，

\phi

，

\omega

初始化 Q 网络

Q_{\theta_{1}},Q_{\theta_{2}}

，扰动网络

\xi_{\phi}

和 VAE

G_{\omega}=\{E_{\omega_{1}},D_{\omega_{2}}\}

，以及目标网络

Q_{\theta^{\prime}_{1}},Q_{\theta^{\prime}_{2}}

，

\xi_{\phi^{\prime}}

，带有

\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}

，

\phi^{\prime}\leftarrow\phi

。

for

t=1

T

do
从

t=1

到

T

进行

Sample mini-batch of

N

transitions

(s,a,r,s^{\prime})

from

\mathcal{B}

从

\mathcal{B}

中抽取

N

个转换的样本小批次

(s,a,r,s^{\prime})

\mu,\sigma=E_{\omega_{1}}(s,a),\quad\tilde{a}=D_{\omega_{2}}(s,z),\quad z\sim\mathcal{N}(\mu,\sigma)

\omega\leftarrow\operatorname*{argmin}_{\omega}\sum(a-\tilde{a})^{2}+D_{\text{KL}}(\mathcal{N}(\mu,\sigma)||\mathcal{N}(0,1))

Sample

n

actions:

\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=1}^{n}

示例

n

操作：

\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=1}^{n}

Perturb each action:

\{a_{i}=a_{i}+\xi_{\phi}(s^{\prime},a_{i},\Phi)\}_{i=1}^{n}

扰动每个动作：

\{a_{i}=a_{i}+\xi_{\phi}(s^{\prime},a_{i},\Phi)\}_{i=1}^{n}

Set value target

y

(Eqn. 13)
设置目标值

y

（方程式 13）

\theta\leftarrow\operatorname*{argmin}_{\theta}\sum(y-Q_{\theta}(s,a))^{2}

\phi\leftarrow\operatorname*{argmax}_{\phi}\sum Q_{\theta_{1}}(s,a+\xi_{\phi}(s,a,\Phi)),a\sim G_{\omega}(s)

Update target networks:

\theta^{\prime}_{i}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}_{i}

更新目标网络：

\theta^{\prime}_{i}\leftarrow\tau\theta+(1-\tau)\theta^{\prime}_{i}

\phi^{\prime}\leftarrow\tau\phi+(1-\tau)\phi^{\prime}

end for 结束循环

The perturbation model $\xi_{\phi}$ can be trained to maximize $Q_{\theta}(s,a)$ through the deterministic policy gradient algorithm (Silver et al., 2014) by sampling $a\sim G_{\omega}(s)$ :
扰动模型 $\xi_{\phi}$ 可以通过采样 $a\sim G_{\omega}(s)$ ，通过确定性策略梯度算法（Silver 等人，2014 年）进行训练以最大化 $Q_{\theta}(s,a)$ ：

\phi\leftarrow\operatorname*{argmax}_{\phi}\sum_{(s,a)\in\mathcal{B}}Q_{\theta}(s,a+\xi_{\phi}(s,a,\Phi)).

(12)

To penalize uncertainty over future states, we modify Clipped Double Q-learning (Fujimoto et al., 2018), which estimates the value by taking the minimum between two Q-networks $\{Q_{\theta_{1}},Q_{\theta_{2}}\}$ . Although originally used as a countermeasure to overestimation bias (Thrun & Schwartz, 1993; Van Hasselt, 2010), the minimum operator also penalizes high variance estimates in regions of uncertainty, and pushes the policy to favor actions which lead to states contained in the batch. In particular, we take a convex combination of the two values, with a higher weight on the minimum, to form a learning target which is used by both Q-networks:
为了惩罚对未来状态的不确定性，我们修改了 Clipped Double Q-learning（Fujimoto 等人，2018），该方法通过两个 Q 网络之间取最小值来估计价值。尽管最初被用作对过度估计偏差的对策（Thrun＆Schwartz，1993；Van Hasselt，2010），最小操作符也惩罚不确定性区域中的高方差估计，并推动策略偏向导致批处理中包含的状态的动作。具体来说，我们采用两个值的凸组合，更重视最小值，形成一个学习目标，被两个 Q 网络使用：

r+\gamma\max_{a_{i}}\left[\lambda\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},a_{i})+(1-\lambda)\max_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},a_{i})\right]

(13)

where $a_{i}$ corresponds to the perturbed actions, sampled from the generative model. If we set $\lambda=1$ , this update corresponds to Clipped Double Q-learning. We use this weighted minimum as the constrained updates produces less overestimation bias than a purely greedy policy update, and enables control over how heavily uncertainty at future time steps is penalized through the choice of $\lambda$ .
其中 $a_{i}$ 对应于从生成模型中采样的扰动动作。如果我们设置 $\lambda=1$ ，这个更新对应于 Clipped Double Q-learning。我们使用这个加权最小值作为受限更新，因为相比于纯贪婪策略更新，它产生更少的高估偏差，并且通过选择 $\lambda$ 可以控制对未来时间步长的不确定性的惩罚程度。

This forms Batch-Constrained deep Q-learning (BCQ), which maintains four parametrized networks: a generative model $G_{\omega}(s)$ , a perturbation model $\xi_{\phi}(s,a)$ , and two Q-networks $Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a)$ . We summarize BCQ in Algorithm 1. In the following section, we demonstrate BCQ results in stable value learning and a strong performance in the batch setting. Furthermore, we find that only a single choice of hyper-parameters is necessary for a wide range of tasks and environments.
这形成了批次约束深度 Q 学习（BCQ），它维护四个参数化网络：一个生成模型，一个扰动模型和两个 Q 网络。我们在算法 1 中总结了 BCQ。在接下来的部分中，我们展示了 BCQ 在稳定值学习和批处理设置中的强大表现。此外，我们发现对于广泛的任务和环境，只需要一个超参数选择。

5 Experiments 5 实验

To evaluate the effectiveness of Batch-Constrained deep Q-learning (BCQ) in a high-dimensional setting, we focus on MuJoCo environments in OpenAI gym (Todorov et al., 2012; Brockman et al., 2016). For reproducibility, we make no modifications to the original environments or reward functions. We compare our method with DDPG (Lillicrap et al., 2015), DQN (Mnih et al., 2015) using an independently discretized action space, a feed-forward behavioral cloning method (BC), and a variant with a VAE (VAE-BC), using $G_{\omega}(s)$ from BCQ. Exact implementation and experimental details are provided in the Supplementary Material.
为了评估批量约束深度 Q 学习（BCQ）在高维环境中的有效性，我们专注于 OpenAI gym 中的 MuJoCo 环境（Todorov 等，2012 年；Brockman 等，2016 年）。为了可重现性，我们对原始环境或奖励函数不做任何修改。我们将我们的方法与 DDPG（Lillicrap 等，2015 年）、DQN（Mnih 等，2015 年）使用独立离散化的动作空间、前馈行为克隆方法（BC）以及带有 VAE 的变体（VAE-BC）进行比较，使用 BCQ 中的 $G_{\omega}(s)$ 。具体实现和实验细节请参阅附录材料。

We evaluate each method following the three experiments defined in Section 3.1. In final buffer the off-policy agents learn from the final replay buffer gathered by training a DDPG agent over a million time steps. In concurrent the off-policy agents learn concurrently, with the same replay buffer, as the behavioral DDPG policy, and in imitation, the agents learn from a dataset collected by an expert policy. Additionally, to study the robustness of BCQ to noisy and multi-modal data, we include an imperfect demonstrations task, in which the agents are trained with a batch of 100k transitions collected by an expert policy, with two sources of noise. The behavioral policy selects actions randomly with probability $0.3$ and with high exploratory noise $\mathcal{N}(0,0.3)$ added to the remaining actions. The experimental results for these tasks are reported in Figure 2. Furthermore, the estimated values of BCQ, DDPG and DQN, and the true value of BCQ are displayed in Figure 3.
我们根据第 3.1 节中定义的三个实验评估每种方法。在最终缓冲区中，离策略代理从通过训练 DDPG 代理器获得的最终重放缓冲区中学习了一百万个时间步骤。在并发中，离策略代理同时学习，使用相同的重放缓冲区，作为行为 DDPG 策略，而在模仿中，代理从专家策略收集的数据集中学习。此外，为了研究 BCQ 对嘈杂和多模态数据的鲁棒性，我们包括一个不完美的演示任务，代理通过专家策略收集的 10 万个转换的批次进行训练，其中包含两种噪声来源。行为策略以概率 $0.3$

Our approach, BCQ, is the only algorithm which succeeds at all tasks, matching or outperforming the behavioral policy in each instance, and outperforming all other agents, besides in the imitation learning task where behavioral cloning unsurprisingly performs the best. These results demonstrate that our algorithm can be used as a single approach for both imitation learning and off-policy reinforcement learning, with a single set of fixed hyper-parameters. Furthermore, unlike the deep reinforcement learning algorithms, DDPG and DQN, BCQ exhibits a highly stable value function in the presence of off-policy samples, suggesting extrapolation error has been successfully mitigated through the batch-constraint. In the imperfect demonstrations task, we find that both deep reinforcement learning and imitation learning algorithms perform poorly. BCQ, however, is able to strongly outperform the noisy demonstrator, disentangling poor and expert actions. Furthermore, compared to current deep reinforcement learning algorithms, which can require millions of time steps (Duan et al., 2016; Henderson et al., 2017), BCQ attains a high performance in remarkably few iterations. This suggests our approach effectively leverages expert transitions, even in the presence of noise.
我们的方法 BCQ 是唯一成功完成所有任务的算法，在每个实例中与行为策略匹配或胜过它，并且在模仿学习任务中胜过所有其他代理，除了行为克隆当然表现最好。这些结果表明我们的算法可以作为单一方法用于模仿学习和离策略强化学习，使用一组固定的超参数。此外，与深度强化学习算法 DDPG 和 DQN 不同，BCQ 在离策略样本存在时表现出高度稳定的值函数，表明通过批次约束成功地减轻了外推误差。在不完美的演示任务中，我们发现深度强化学习和模仿学习算法表现不佳。然而，BCQ 能够强大地胜过嘈杂的演示者，区分出不良和专家行为。此外，与当前的深度强化学习算法相比，后者可能需要数百万个时间步（Duan 等，2016; Henderson 等，2017），BCQ 在极少的迭代中取得了高性能。这表明我们的方法在存在噪音的情况下有效地利用专家过渡。

6 Related Work 相关工作

Batch Reinforcement Learning. While batch reinforcement learning algorithms have been shown to be convergent with non-parametric function approximators such as averagers (Gordon, 1995) and kernel methods (Ormoneit & Sen, 2002), they make no guarantees on the quality of the policy without infinite data. Other batch algorithms, such as fitted Q-iteration, have used other function approximators, including decision trees (Ernst et al., 2005) and neural networks (Riedmiller, 2005), but come without convergence guarantees. Unlike many previous approaches to off-policy policy evaluation (Peshkin & Shelton, 2002; Thomas et al., 2015; Liu et al., 2018), our work focuses on constraining the policy to a subset of policies which can be adequately evaluated, rather than the process of evaluation itself. Additionally, off-policy algorithms which rely on importance sampling (Precup et al., 2001; Jiang & Li, 2016; Munos et al., 2016) may not be applicable in a batch setting, requiring access to the action probabilities under the behavioral policy, and scale poorly to multi-dimensional action spaces. Reinforcement learning with a replay buffer (Lin, 1992) can be considered a form of batch reinforcement learning, and is a standard tool for off-policy deep reinforcement learning algorithms (Mnih et al., 2015). It has been observed that a large replay buffer can be detrimental to performance (de Bruin et al., 2015; Zhang & Sutton, 2017) and the diversity of states in the buffer is an important factor for performance (de Bruin et al., 2016). Isele & Cosgun (2018) observed the performance of an agent was strongest when the distribution of data in the replay buffer matched the test distribution. These results defend the notion that extrapolation error is an important factor in the performance off-policy reinforcement learning.
批量强化学习。尽管已经证明批量强化学习算法与非参数函数逼近器（如平均值器（Gordon，1995）和核方法（Ormoneit＆Sen，2002））收敛，但在没有无限数据的情况下，它们不能保证策略的质量。其他批量算法，如拟合 Q-迭代，已经使用其他函数逼近器，包括决策树（Ernst 等，2005）和神经网络（Riedmiller，2005），但没有收敛保证。与以往许多关于离线策略评估的方法不同（Peshkin＆Shelton，2002；Thomas 等，2015；Liu 等，2018），我们的工作侧重于将策略限制在可以充分评估的策略子集上，而不是评估过程本身。此外，依赖重要性抽样的离线算法（Precup 等，2001；Jiang＆Li，2016；Munos 等，2016）可能在批量设置中不适用，需要访问行为策略下的动作概率，并且在多维动作空间中扩展性较差。使用回放缓冲区的强化学习（Lin，1992）可以被视为一种批量强化学习的形式，并且是离策略深度强化学习算法的标准工具（Mnih 等，2015）。已经观察到，大型回放缓冲区可能对性能有害（de Bruin 等，2015；Zhang 和 Sutton，2017），缓冲区中状态的多样性是性能的重要因素（de Bruin 等，2016）。Isele 和 Cosgun（2018）观察到，当回放缓冲区中的数据分布与测试分布匹配时，代理的性能最强。这些结果支持了外推误差是离策略强化学习性能的重要因素的观点。

Imitation Learning. Imitation learning and its variants are well studied problems (Schaal, 1999; Argall et al., 2009; Hussein et al., 2017). Imitation has been combined with reinforcement learning, via learning from demonstrations methods (Kim et al., 2013; Piot et al., 2014; Chemali & Lazaric, 2015), with deep reinforcement learning extensions (Hester et al., 2017; Večerík et al., 2017), and modified policy gradient approaches (Ho et al., 2016; Sun et al., 2017; Cheng et al., 2018; Sun et al., 2018). While effective, these interactive methods are inadequate for batch reinforcement learning as they require either an explicit distinction between expert and non-expert data, further on-policy data collection or access to an oracle. Research in imitation, and inverse reinforcement learning, with robustness to noise is an emerging area (Evans, 2016; Nair et al., 2018), but relies on some form of expert data. Gao et al. (2018) introduced an imitation learning algorithm which learned from imperfect demonstrations, by favoring seen actions, but is limited to discrete actions. Our work also connects to residual policy learning (Johannink et al., 2018; Silver et al., 2018), where the initial policy is the generative model, rather than an expert or feedback controller.
模仿学习。模仿学习及其变体是研究较多的问题（Schaal, 1999; Argall 等，2009; Hussein 等，2017）。模仿已经与强化学习相结合，通过从示范中学习的方法（Kim 等，2013; Piot 等，2014; Chemali & Lazaric，2015），以及深度强化学习扩展（Hester 等，2017; Večerík 等，2017），以及修改的策略梯度方法（Ho 等，2016; Sun 等，2017; Cheng 等，2018; Sun 等，2018）。尽管有效，这些交互式方法对于批量强化学习来说是不足的，因为它们要求明确区分专家和非专家数据，进一步进行在线数据收集或访问预言者。模仿学习和逆向强化学习的研究，以及对噪声的鲁棒性是一个新兴领域（Evans，2016; Nair 等，2018），但依赖于某种形式的专家数据。Gao 等（2018）提出了一种模仿学习算法，该算法从不完美的示范中学习，通过偏爱已见行动，但仅限于离散行动。我们的工作也与残差策略学习相关（Johannink 等，2018; Silver 等）。在这种情况下，初始政策是生成模型，而不是专家或反馈控制器。

Uncertainty in Reinforcement Learning. Uncertainty estimates in deep reinforcement learning have generally been used to encourage exploration (Dearden et al., 1998; Strehl & Littman, 2008; O’Donoghue et al., 2018; Azizzadenesheli et al., 2018). Other methods have examined approximating the Bayesian posterior of the value function (Osband et al., 2016, 2018; Touati et al., 2018), again using the variance to encourage exploration to unseen regions of the state space. In model-based reinforcement learning, uncertainty has been used for exploration, but also for the opposite effect–to push the policy towards regions of certainty in the model. This is used to combat the well-known problems with compounding model errors, and is present in policy search methods (Deisenroth & Rasmussen, 2011; Gal et al., 2016; Higuera et al., 2018; Xu et al., 2018), or combined with trajectory optimization (Chua et al., 2018) or value-based methods (Buckman et al., 2018). Our work connects to policy methods with conservative updates (Kakade & Langford, 2002), such as trust region (Schulman et al., 2015; Achiam et al., 2017; Pham et al., 2018) and information-theoretic methods (Peters & Mülling, 2010; Van Hoof et al., 2017), which aim to keep the updated policy similar to the previous policy. These methods avoid explicit uncertainty estimates, and rather force policy updates into a constrained range before collecting new data, limiting errors introduced by large changes in the policy. Similarly, our approach can be thought of as an off-policy variant, where the policy aims to be kept close, in output space, to any combination of the previous policies which performed data collection.
强化学习中的不确定性。深度强化学习中的不确定性估计通常用于鼓励探索。其他方法则试图近似值函数的贝叶斯后验，再次利用方差来鼓励探索未知状态空间的区域。在基于模型的强化学习中，不确定性被用于探索，同时也用于相反的效果——将策略推向模型中确定性的区域。这用于解决与模型错误叠加相关的众所周知的问题，并出现在策略搜索方法中，或与轨迹优化或基于值的方法相结合。我们的工作与具有保守更新的策略方法相关。2018 年）和信息论方法（Peters＆Mülling，2010；Van Hoof 等，2017 年），旨在使更新的策略与先前的策略相似。这些方法避免明确的不确定性估计，而是在收集新数据之前将策略更新强制限制在一定范围内，限制由策略大幅变化引入的错误。类似地，我们的方法可以被视为一种离线策略变体，其中策略旨在在输出空间中保持接近于执行数据收集的任何组合的先前策略。

7 Conclusion 结论

In this work, we demonstrate a critical problem in off-policy reinforcement learning with finite data, where the value target introduces error by including an estimate of unseen state-action pairs. This phenomenon, which we denote extrapolation error, has important implications for off-policy and batch reinforcement learning, as it is generally implausible to have complete state-action coverage in any practical setting. We present batch-constrained reinforcement learning–acting close to on-policy with respect to the available data, as an answer to extrapolation error. When extended to a deep reinforcement learning setting, our algorithm, Batch-Constrained deep Q-learning (BCQ), is the first continuous control algorithm capable of learning from arbitrary batch data, without exploration. Due to the importance of batch reinforcement learning for practical applications, we believe BCQ will be a strong foothold for future algorithms to build on, while furthering our understanding of the systematic risks in Q-learning (Thrun & Schwartz, 1993; Lu et al., 2018).
在这项工作中，我们展示了离线策略强化学习中有限数据的一个关键问题，即值目标通过包含对未见状态-动作对的估计而引入错误。我们将这种现象称为外推误差，对离线和批量强化学习具有重要意义，因为在任何实际情况下完全覆盖状态-动作是不切实际的。我们提出了批量约束强化学习——在可用数据方面接近于在线策略，作为对外推误差的解决方案。当扩展到深度强化学习环境时，我们的算法，批量约束深度 Q 学习（BCQ），是第一个能够从任意批量数据中学习而无需探索的连续控制算法。由于批量强化学习对实际应用的重要性，我们相信 BCQ 将成为未来算法发展的一个坚实基础，同时深化我们对 Q 学习中系统风险的理解（Thrun & Schwartz，1993；Lu 等，2018）。

References 参考资料

Achiam et al. (2017) Achiam 等人（2017） Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International Conference on Machine Learning, pp. 22–31, 2017.
Achiam, J., Held, D., Tamar, A., 和 Abbeel, P. 受限策略优化。在机器学习国际会议上，页码 22-31，2017 年。
Argall et al. (2009) Argall 等人（2009） Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009.
Argall, B. D., Chernova, S., Veloso, M., 和 Browning, B. 机器人示教学习调查。《机器人与自主系统》，57(5):469–483，2009。
Azizzadenesheli et al. (2018)
Azizzadenesheli 等人（2018） Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. Efficient exploration through bayesian deep q-networks. arXiv preprint arXiv:1802.04412, 2018.
Azizzadenesheli, K., Brunskill, E., 和 Anandkumar, A. 高效探索通过贝叶斯深度 Q 网络. arXiv 预印本 arXiv:1802.04412, 2018.
Bertsekas & Tsitsiklis (1996)
Bertsekas & Tsitsiklis（1996） Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena scientific Belmont, MA, 1996.
Bertsekas, D. P. 和 Tsitsiklis, J. N.《神经动态规划》。Athena scientific Belmont, MA, 1996.
Brockman et al. (2016) Brockman 等人（2016） Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., 和 Zaremba, W. OpenAI Gym, 2016.
Buckman et al. (2018) Buckman 等人（2018） Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8234–8244, 2018.
Buckman, J., Hafner, D., Tucker, G., Brevdo, E., 和 Lee, H. 随机集成值扩展的样本高效强化学习。在神经信息处理系统的进展中，第 8234-8244 页，2018 年。
Chemali & Lazaric (2015) Chemali, J. and Lazaric, A. Direct policy iteration with demonstrations. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
Chemali, J.和 Lazaric, A.直接政策迭代与演示。在 2015 年第二十四届国际人工智能联合会议论文集中。
Cheng et al. (2018) 程等人（2018） Cheng, C.-A., Yan, X., Wagener, N., and Boots, B. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413, 2018.
Cheng, C.-A., Yan, X., Wagener, N., 和 Boots, B. 通过模仿和强化快速政策学习. arXiv 预印本 arXiv:1805.10413, 2018.
Chua et al. (2018) 蔡等人（2018） Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems 31, pp. 4759–4770, 2018.
Chua, K., Calandra, R., McAllister, R., 和 Levine, S. 利用概率动力学模型在少数试验中进行深度强化学习。在 Advances in Neural Information Processing Systems 31, pp. 4759–4770, 2018.
Dayan & Watkins (1992)
戴恩和沃特金斯（1992） Dayan, P. and Watkins, C. J. C. H. Q-learning. Machine learning, 8(3):279–292, 1992.
Dayan, P. 和 Watkins, C. J. C. H. Q-learning. 机器学习, 8(3):279–292, 1992.
de Bruin et al. (2015) de Bruin 等人（2015） de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. The importance of experience replay database composition in deep reinforcement learning. In Deep Reinforcement Learning Workshop, NIPS, 2015.
de Bruin, T., Kober, J., Tuyls, K., 和 Babuška, R. 经验重放数据库构成在深度强化学习中的重要性。在 NIPS 的深度强化学习研讨会上，2015 年。
de Bruin et al. (2016) de Bruin 等人（2016） de Bruin, T., Kober, J., Tuyls, K., and Babuška, R. Improved deep reinforcement learning for robotics through distribution-based experience retention. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3947–3952. IEEE, 2016.
de Bruin, T., Kober, J., Tuyls, K., 和 Babuška, R. 通过基于分布的经验保留改进机器人的深度强化学习。在 IEEE/RSJ 国际智能机器人与系统大会(IROS)上发表，第 3947-3952 页。IEEE，2016。
Dearden et al. (1998) Dearden 等人（1998） Dearden, R., Friedman, N., and Russell, S. Bayesian q-learning. In AAAI/IAAI, pp. 761–768, 1998.
Dearden, R., Friedman, N., 和 Russell, S. 贝叶斯 Q 学习. 在 AAAI/IAAI 会议上，页码 761-768，1998 年。
Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. Pilco: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pp. 465–472, 2011.
Deisenroth, M. 和 Rasmussen, C. E. Pilco: 一种基于模型且数据高效的策略搜索方法。在 2011 年机器学习国际会议上，第 465-472 页。
Duan et al. (2016) 段等人（2016） Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
段宇，陈旭，豪特夫特，舒尔曼和阿贝尔。连续控制的深度强化学习基准测试。在机器学习国际会议上，第 1329-1338 页，2016 年。
Ernst et al. (2005) 恩斯特等人（2005 年） Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
Ernst, D., Geurts, P., 和 Wehenkel, L. 基于树的批处理模式强化学习。机器学习研究杂志, 6(Apr):503–556, 2005.
Evans (2016) Evans, O. Learning the preferences of ignorant, inconsistent agents. In AAAI, pp. 323–329, 2016.
Evans, O. 学习无知、不一致代理的偏好。在 AAAI，第 323-329 页，2016 年。
Fujimoto et al. (2018) 藤本等人（2018） Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, volume 80, pp. 1587–1596. PMLR, 2018.
Fujimoto, S., van Hoof, H., 和 Meger, D. 解决演员-评论家方法中的函数逼近误差。在机器学习国际会议上，第 80 卷，第 1587-1596 页。PMLR，2018。
Gal et al. (2016) Gal 等人（2016） Gal, Y., McAllister, R., and Rasmussen, C. E. Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, International Conference on Machine Learning, 2016.
Gal, Y., McAllister, R., 和 Rasmussen, C. E. 用贝叶斯神经网络动力学模型改进 PILCO。在 2016 年国际机器学习会议的数据高效机器学习研讨会上。
Gao et al. (2018) 高等（2018） Gao, Y., Lin, J., Yu, F., Levine, S., and Darrell, T. Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313, 2018.
Gao, Y., Lin, J., Yu, F., Levine, S., 和 Darrell, T. 不完美演示的强化学习。arXiv 预印本 arXiv:1802.05313, 2018.
Goodfellow et al. (2013)
Goodfellow 等人（2013） Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., 和 Bengio, Y. 基于梯度的神经网络中灾难性遗忘的实证研究. arXiv 预印本 arXiv:1312.6211, 2013.
Gordon (1995) 戈登（1995） Gordon, G. J. Stable function approximation in dynamic programming. In Machine Learning Proceedings 1995, pp. 261–268. Elsevier, 1995.
Gordon, G. J. 动态规划中的稳定函数逼近。在 1995 年机器学习会议论文集中，第 261-268 页。Elsevier，1995 年。
Henderson et al. (2017) Henderson 等人（2017） Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. arXiv preprint arXiv:1709.06560, 2017.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., 和 Meger, D. 重要的深度强化学习. arXiv 预印本 arXiv:1709.06560, 2017.
Hester et al. (2017) Hester 等人（2017） Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Dulac-Arnold, G., 等。基于演示的深度 Q 学习。arXiv 预印本 arXiv:1704.03732，2017。
Higuera et al. (2018) Higuera 等人（2018） Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizing neural network controllers with probabilistic model based reinforcement learning. arXiv preprint arXiv:1803.02291, 2018.
Higuera, J. C. G., Meger, D., 和 Dudek, G. 用基于概率模型的强化学习合成神经网络控制器。arXiv 预印本 arXiv:1803.02291, 2018.
Ho et al. (2016) 何等。（2016 年） Ho, J., Gupta, J., and Ermon, S. Model-free imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769, 2016.
Ho, J., Gupta, J., 和 Ermon, S. 无模型模仿学习与策略优化。在机器学习国际会议上，第 2760-2769 页，2016 年。
Hussein et al. (2017) Hussein 等人（2017） Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):21, 2017.
Hussein, A., Gaber, M. M., Elyan, E., 和 Jayne, C. 模仿学习：学习方法综述。ACM 计算调查（CSUR），50(2)：21，2017。
Isele & Cosgun (2018) Isele, D. and Cosgun, A. Selective experience replay for lifelong learning. arXiv preprint arXiv:1802.10269, 2018.
Isele, D. 和 Cosgun, A. 终身学习的选择性经验重播。arXiv 预印本 arXiv:1802.10269，2018。
Jaksch et al. (2010) Jaksch 等人（2010） Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
Jaksch, T., Ortner, R., 和 Auer, P. 强化学习的近乎最优遗憾界. 机器学习研究杂志, 11(Apr):1563–1600, 2010.
Jiang & Li (2016)
江和李（2016） Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 652–661, 2016.
姜宁和李磊。强化学习的双重稳健离线策略价值评估。在 2016 年机器学习国际会议上，第 652-661 页。
Johannink et al. (2018) Johannink 等人（2018） Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J. A., Solowjow, E., and Levine, S. Residual reinforcement learning for robot control. arXiv preprint arXiv:1812.03201, 2018.
Johannink, T., Bahl, S., Nair, A., Luo, J., Kumar, A., Loskyll, M., Ojea, J. A., Solowjow, E., 和 Levine, S. 残余强化学习用于机器人控制。arXiv 预印本 arXiv:1812.03201, 2018.
Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pp. 267–274, 2002.
Kakade, S. 和 Langford, J. 近似最优近似强化学习。在机器学习国际会议上，第 2 卷，2002 年，第 267-274 页。
Kim et al. (2013) Kim 等人（2013） Kim, B., Farahmand, A.-m., Pineau, J., and Precup, D. Learning from limited demonstrations. In Advances in Neural Information Processing Systems, pp. 2859–2867, 2013.
Kim, B., Farahmand, A.-m., Pineau, J., 和 Precup, D. 从有限演示中学习. 在神经信息处理系统的进展中, 页码 2859–2867, 2013.
Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kingma, D.和 Ba, J. Adam: 一种随机优化方法。arXiv 预印本 arXiv:1412.6980，2014。
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kingma, D. P. 和 Welling, M. 自编码变分贝叶斯。arXiv 预印本 arXiv:1312.6114，2013。
Konda & Tsitsiklis (2003)
康达和齐齐克利斯（2003） Konda, V. R. and Tsitsiklis, J. N. On actor-critic algorithms. SIAM journal on Control and Optimization, 42(4):1143–1166, 2003.
Konda, V. R.和 Tsitsiklis, J. N.关于演员-评论家算法。SIAM 控制与优化杂志，42(4):1143–1166，2003。
Lai & Robbins (1985)
赖和罗宾斯（1985） Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
Lai, T. L.和 Robbins, H. 渐近有效的自适应分配规则。应用数学进展，6(1):4–22，1985。
Lange et al. (2012) Lange 等人（2012） Lange, S., Gabel, T., and Riedmiller, M. Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Springer, 2012.
Lange, S., Gabel, T., 和 Riedmiller, M. 批量强化学习. 在强化学习中, 第 45-73 页. Springer, 2012.
Lillicrap et al. (2015) Lillicrap 等人（2015） Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., 和 Wierstra, D. 深度强化学习的连续控制。arXiv 预印本 arXiv:1509.02971，2015。
Lin (1992) 林（1992） Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
Lin, L.-J.基于强化学习、规划和教学的自我改进的反应式智能体。机器学习，8(3-4):293–321，1992。
Liu et al. (2018) 刘等人（2018） Liu, Y., Gottesman, O., Raghu, A., Komorowski, M., Faisal, A. A., Doshi-Velez, F., and Brunskill, E. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pp. 2644–2653, 2018.
刘，Y.，戈特斯曼，O.，拉古，A.，科莫罗夫斯基，M.，费萨尔，A. A.，多什-韦莱兹，F.，布伦斯基尔，E. 代表平衡 MDPS 用于离线策略评估。在神经信息处理系统的进展中，2018 年，第 2644-2653 页。
Lu et al. (2018) 陆等人（2018） Lu, T., Schuurmans, D., and Boutilier, C. Non-delusional q-learning and value-iteration. In Advances in Neural Information Processing Systems, pp. 9971–9981, 2018.
Lu, T., Schuurmans, D., 和 Boutilier, C. 非妄想 q-learning 和值迭代。在神经信息处理系统的进展中，第 9971-9981 页，2018 年。
McCloskey & Cohen (1989)
麦克洛斯基和科恩（1989） McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pp. 109–165. Elsevier, 1989.
McCloskey, M. 和 Cohen, N. J. 连接主义网络中的灾难性干扰：顺序学习问题。收录于《学习与动机心理学》，第 24 卷，第 109–165 页。Elsevier，1989。
Melo (2001) 梅洛（2001） Melo, F. S. Convergence of q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pp. 1–4, 2001.
Melo, F. S. Q-learning 的收敛性：一个简单的证明。系统与机器人研究所，技术报告，第 1-4 页，2001 年。
Mnih et al. (2015) Mnih 等人（2015） Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., 等。通过深度强化学习实现人类水平控制。《自然》, 518(7540):529–533, 2015.
Munos et al. (2016) Munos 等人（2016） Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
Munos, R., Stepleton, T., Harutyunyan, A., 和 Bellemare, M. 安全高效的离策略强化学习。在神经信息处理系统进展中，第 1054-1062 页，2016 年。
Nair et al. (2018) Nair 等人（2018） Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292–6299. IEEE, 2018.
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., 和 Abbeel, P. 克服强化学习中的探索问题：演示方法。在 2018 年 IEEE 国际机器人与自动化大会(ICRA)上，第 6292-6299 页。IEEE，2018。
O’Donoghue et al. (2018)
O’Donoghue 等人（2018） O’Donoghue, B., Osband, I., Munos, R., and Mnih, V. The uncertainty Bellman equation and exploration. In International Conference on Machine Learning, volume 80, pp. 3839–3848. PMLR, 2018.
O’Donoghue, B., Osband, I., Munos, R., 和 Mnih, V. 不确定性贝尔曼方程和探索。在机器学习国际会议上，第 80 卷，3839-3848 页。PMLR，2018。
Ormoneit & Sen (2002) Ormoneit, D. and Sen, Ś. Kernel-based reinforcement learning. Machine learning, 49(2-3):161–178, 2002.
Ormoneit, D. 和 Sen, Ś. 基于核的强化学习. 机器学习, 49(2-3):161–178, 2002.
Osband et al. (2016) Osband 等人（2016） Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pp. 4026–4034, 2016.
Osband, I., Blundell, C., Pritzel, A., 和 Van Roy, B. 深度探索通过自举 DQN。在神经信息处理系统的进展中，第 4026-4034 页，2016 年。
Osband et al. (2018) Osband 等人（2018） Osband, I., Aslanides, J., and Cassirer, A. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems 31, pp. 8626–8638, 2018.
Osband, I., Aslanides, J., 和 Cassirer, A. 随机先验函数用于深度强化学习. 在《神经信息处理系统进展》第 31 卷, pp. 8626–8638, 2018.
Peshkin & Shelton (2002) Peshkin, L. and Shelton, C. R. Learning from scarce experience. In International Conference on Machine Learning, pp. 498–505, 2002.
Peshkin, L. 和 Shelton, C. R. 从稀缺经验中学习。在 2002 年机器学习国际会议上，第 498-505 页。
Peters & Mülling (2010)
彼得斯和米林（2010） Peters, J. and Mülling, K. Relative entropy policy search. In AAAI, pp. 1607–1612, 2010.
Peters, J. 和 Mülling, K. 相对熵策略搜索. 在 AAAI, 第 1607-1612 页, 2010 年。
Pham et al. (2018) Pham 等人（2018） Pham, T.-H., De Magistris, G., Agravante, D. J., Chaudhury, S., Munawar, A., and Tachibana, R. Constrained exploration and recovery from experience shaping. arXiv preprint arXiv:1809.08925, 2018.
Pham, T.-H., De Magistris, G., Agravante, D. J., Chaudhury, S., Munawar, A., 和 Tachibana, R. 受限制的探索和经验塑造恢复。arXiv 预印本 arXiv:1809.08925, 2018.
Piot et al. (2014) Piot 等人（2014） Piot, B., Geist, M., and Pietquin, O. Boosted bellman residual minimization handling expert demonstrations. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 549–564. Springer, 2014.
Piot, B., Geist, M., 和 Pietquin, O. 增强贝尔曼残差最小化处理专家演示。在欧洲机器学习和数据库知识发现联合会议上，549-564 页。Springer，2014。
Precup et al. (2001) Precup 等人（2001） Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pp. 417–424, 2001.
Precup, D., Sutton, R. S., 和 Dasgupta, S. 利用函数逼近的离策略时序差分学习。在 2001 年机器学习国际会议上，第 417-424 页。
Rezende et al. (2014) Rezende 等人（2014） Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
Rezende, D. J., Mohamed, S., 和 Wierstra, D. 深度生成模型中的随机反向传播和近似推断. arXiv 预印本 arXiv:1401.4082, 2014.
Riedmiller (2005) Riedmiller, M. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.
Riedmiller, M. 神经拟合 Q 迭代——一种数据高效的神经强化学习方法的初步体验。在欧洲机器学习会议上，第 317-328 页。Springer，2005。
Schaal (1999) 舍尔（1999） Schaal, S. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233–242, 1999.
Schaal, S. 模仿学习是通向类人机器人的途径吗？《认知科学趋势》，3(6):233–242，1999。
Schulman et al. (2015) Schulman 等人（2015） Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., 和 Moritz, P. 信任区域策略优化. 在机器学习国际会议上，页码 1889–1897，2015 年。
Silver et al. (2014) Silver 等人（2014） Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395, 2014.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., 和 Riedmiller, M. 确定性策略梯度算法. 在机器学习国际会议上, 第 387-395 页, 2014.
Silver et al. (2018) Silver 等人（2018） Silver, T., Allen, K., Tenenbaum, J., and Kaelbling, L. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018.
Silver, T., Allen, K., Tenenbaum, J., 和 Kaelbling, L. 残差策略学习. arXiv 预印本 arXiv:1812.06298, 2018.
Singh et al. (2000) 辛格等人（2000） Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
Singh, S., Jaakkola, T., Littman, M. L., 和 Szepesvári, C. 单步在线策略强化学习算法的收敛结果。机器学习，38(3):287–308，2000。
Sohn et al. (2015) Sohn 等人（2015） Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491, 2015.
Sohn, K., Lee, H., 和 Yan, X. 使用深度条件生成模型学习结构化输出表示。在神经信息处理系统的进展中，第 3483-3491 页，2015 年。
Strehl & Littman (2008) Strehl, A. L. and Littman, M. L. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
Strehl, A. L.和 Littman, M. L.对马尔可夫决策过程的基于模型的区间估计进行了分析。《计算机与系统科学杂志》，74(8):1309–1331，2008。
Sun et al. (2017) 孙等人（2017） Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and Bagnell, J. A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International Conference on Machine Learning, pp. 3309–3318, 2017.
Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., 和 Bagnell, J. A. 深度激怒：用于序列预测的可微模仿学习。在机器学习国际会议上，第 3309-3318 页，2017 年。
Sun et al. (2018) 孙等人（2018） Sun, W., Bagnell, J. A., and Boots, B. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv preprint arXiv:1805.11240, 2018.
Sun, W., Bagnell, J. A., 和 Boots, B. 截断视野策略搜索：结合强化学习和模仿学习。arXiv 预印本 arXiv:1805.11240，2018。
Sutton (1988) 萨顿（1988） Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
Sutton, R. S. 时间差异方法预测学习。机器学习，3(1):9–44，1988。
Sutton & Barto (1998)
萨顿和巴托（1998） Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
萨顿，R.S.和巴托，A.G.强化学习：导论，第 1 卷。麻省理工学院出版社，1998 年。
Thomas et al. (2015) Thomas 等人（2015） Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380–2388, 2015.
Thomas, P., Theocharous, G., 和 Ghavamzadeh, M. 高置信度策略改进. 在机器学习国际会议上，第 2380-2388 页，2015 年。
Thrun & Schwartz (1993)
Thrun & Schwartz（1993） Thrun, S. and Schwartz, A. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
Thrun, S.和 Schwartz, A.在强化学习中使用函数逼近的问题。在 1993 年连接主义模型暑期学校会议论文集中。劳伦斯埃尔巴姆，1993。
Todorov et al. (2012) Todorov 等人（2012） Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5026–5033. IEEE, 2012.
Todorov, E., Erez, T., 和 Tassa, Y. Mujoco: 一个基于模型控制的物理引擎。在 IEEE/RSJ 国际智能机器人与系统大会(IROS)上，5026-5033 页。IEEE，2012。
Touati et al. (2018) Touati 等人（2018） Touati, A., Satija, H., Romoff, J., Pineau, J., and Vincent, P. Randomized value functions via multiplicative normalizing flows. arXiv preprint arXiv:1806.02315, 2018.
Touati, A., Satija, H., Romoff, J., Pineau, J., 和 Vincent, P. 通过乘法归一化流的随机值函数。arXiv 预印本编号 arXiv:1806.02315，2018。
Van Hasselt (2010) 范哈塞尔特（2010） Van Hasselt, H. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010.
Van Hasselt, H. 双 Q 学习. 在神经信息处理系统的进展中，第 2613-2621 页，2010 年。
Van Hasselt et al. (2016)
Van Hasselt 等人（2016） Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, pp. 2094–2100, 2016.
Van Hasselt, H., Guez, A., 和 Silver, D. 深度强化学习与双 Q 学习。在 AAAI，第 2094-2100 页，2016 年。
Van Hoof et al. (2017) Van Hoof 等人（2017） Van Hoof, H., Neumann, G., and Peters, J. Non-parametric policy search with limited information loss. The Journal of Machine Learning Research, 18(1):2472–2517, 2017.
Van Hoof, H., Neumann, G., 和 Peters, J. 有限信息损失的非参数策略搜索。机器学习研究杂志, 18(1):2472–2517, 2017.
Večerík et al. (2017) Večerík 等人（2017） Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.
Večerík, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., 和 Riedmiller, M. 利用演示在稀疏奖励的机器人问题上进行深度强化学习。arXiv 预印本 arXiv:1707.08817, 2017.
Watkins (1989) Watkins（1989） Watkins, C. J. C. H. Learning from delayed rewards. PhD thesis, King’s College, Cambridge, 1989.
沃特金斯，C. J. C. C. 从延迟奖励中学习。剑桥大学国王学院博士论文，1989 年。
Xu et al. (2018) 徐等人（2018） Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for model-based reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
Xu, H., Li, Y., Tian, Y., Darrell, T., 和 Ma, T. 带有理论保证的基于模型的强化学习算法框架. arXiv 预印本 arXiv:1807.03858, 2018.
Zhang & Sutton (2017)
张和萨顿（2017） Zhang, S. and Sutton, R. S. A deeper look at experience replay. arXiv preprint arXiv:1712.01275, 2017.
张，S.和萨顿，R.S.深入研究经验回放。arXiv 预印本 arXiv:1712.01275，2017。

Appendix A Missing Proofs 附录 A 缺失的证明

A.1 Proofs and Details from Section 4.1
4.1 节的证明和细节

Definition 1. We define a coherent batch $\mathcal{B}$ as a batch such that if $(s,a,s^{\prime})\in\mathcal{B}$ then $s^{\prime}\in\mathcal{B}$ unless $s^{\prime}$ is a terminal state.
定义 1. 我们将一个连贯的批次 $\mathcal{B}$

Definition 2. We define $\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a)$ as the error between the true value of a policy $\pi$ in the MDP $M$ and the value of $\pi$ when learned with a batch $\mathcal{B}$ .
定义 2. 我们将 $\epsilon_{\text{MDP}}(s,a)=Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a)$ 定义为在 MDP $M$ 中策略 $\pi$ 的真实值与使用批处理 $\mathcal{B}$ 学习时的值 $\pi$ 之间的误差。

Definition 3. For simplicity in notation, we denote
定义 3. 为了简化表示法，我们表示

\epsilon^{\pi}_{\text{MDP}}=\sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)|\epsilon_{\text{MDP}}(s,a)|.

(14)

To evaluate a policy $\pi$ exactly at relevant state-action pairs, only $\epsilon^{\pi}_{\text{MDP}}=0$ is required.
要在相关的状态-动作对上精确评估一项政策，只需要 $\epsilon^{\pi}_{\text{MDP}}=0$ 。

Definition 4. We define the optimal batch-constrained policy $\pi^{*}\in\Pi_{\mathcal{B}}$ such that $Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a)$ for all $\pi\in\Pi_{\mathcal{B}}$ and $(s,a)\in\mathcal{B}$ .
定义 4. 我们定义最优的批量约束策略 $\pi^{*}\in\Pi_{\mathcal{B}}$ ，使得对于所有 $\pi\in\Pi_{\mathcal{B}}$ 和 $(s,a)\in\mathcal{B}$ 。

Algorithm 1. Batch-Constrained Q-learning (BCQL) maintains a tabular value function $Q(s,a)$ for each possible state-action pair $(s,a)$ . A transition tuple $(s,a,r,s^{\prime})$ is sampled from the batch $\mathcal{B}$ with uniform probability and the following update rule is applied, with learning rate $\alpha$ :
算法 1. 批量约束 Q 学习（BCQL）维护每个可能的状态-动作对 $(s,a)$ 的表格值函数 $Q(s,a)$ 。从批量 $\mathcal{B}$ 中均匀采样一个转移元组 $(s,a,r,s^{\prime})$ ，并应用以下更新规则，学习率为 $\alpha$ ：

Q(s,a)\leftarrow(1-\alpha)Q(s,a)+\alpha(r+\gamma\max_{a^{\prime}\text{s.t.}(s^{\prime},a^{\prime})\in\mathcal{B}}Q(s^{\prime},a^{\prime})).

(15)

Proof. Again, the MDP $M_{\mathcal{B}}$ is defined by the same action and state space as $M$ , with an additional terminal state $s_{\text{init}}$ . $M_{\mathcal{B}}$ has transition probabilities $p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}$ , where $N(s,a,s^{\prime})$ is the number of times the tuple $(s,a,s^{\prime})$ is observed in $\mathcal{B}$ . If $\sum_{\tilde{s}}N(s,a,\tilde{s})=0$ , then $p_{\mathcal{B}}(s_{\text{init}}|s,a)=1$ , where $r(s_{\text{init}},s,a)$ is to the initialized value of $Q(s,a)$ .
证明。再次，MDP $M_{\mathcal{B}}$ 的定义与 $M$ 相同，具有相同的动作和状态空间，还有一个额外的终止状态 $s_{\text{init}}$ 。 $M_{\mathcal{B}}$ 具有转移概率 $p_{\mathcal{B}}(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}$ ，其中 $N(s,a,s^{\prime})$ 是元组 $(s,a,s^{\prime})$ 在 $\mathcal{B}$ 中观察到的次数。如果 $\sum_{\tilde{s}}N(s,a,\tilde{s})=0$ ，那么 $p_{\mathcal{B}}(s_{\text{init}}|s,a)=1$ ，其中 $r(s_{\text{init}},s,a)$ 是初始化值 $Q(s,a)$ 的值。

For any given MDP Q-learning converges to the optimal value function given infinite state-action visitation and some standard assumptions (see Section A.2). Now note that sampling under a batch $\mathcal{B}$ with uniform probability satisfies the infinite state-action visitation assumptions of the MDP $M_{\mathcal{B}}$ , where given $(s,a)$ , the probability of sampling $(s,a,s^{\prime})$ corresponds to $p(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}$ in the limit. We remark that for $(s,a)\notin\mathcal{B}$ , $Q(s,a)$ will never be updated, and will return the initialized value, which corresponds to the terminal transition $s_{\text{init}}$ . It follows that sampling from $\mathcal{B}$ is equivalent to sampling from the MDP $M_{\mathcal{B}}$ , and Q-learning converges to the optimal value function under $M_{\mathcal{B}}$ .
对于任何给定的 MDP，Q-learning 在无限状态-动作访问和一些标准假设下收敛到最优值函数（见 A.2 节）。现在请注意，在一个批次 $\mathcal{B}$ 下以均匀概率抽样满足 MDP $M_{\mathcal{B}}$ 的无限状态-动作访问假设，其中给定 $(s,a)$ ，抽样 $(s,a,s^{\prime})$ 的概率在极限情况下对应于 $p(s^{\prime}|s,a)=\frac{N(s,a,s^{\prime})}{\sum_{\tilde{s}}N(s,a,\tilde{s})}$ 。我们指出，对于 $(s,a)\notin\mathcal{B}$ ， $Q(s,a)$ 永远不会被更新，并且将返回初始化的值，这对应于终止转换 $s_{\text{init}}$ 。由此可知，从 $\mathcal{B}$ 抽样等同于从 MDP $M_{\mathcal{B}}$ 抽样，而 Q-learning 在 $M_{\mathcal{B}}$ 下收敛到最优值函数。

Remark 1. For any policy $\pi$ and state-action pair $(s,a)$ , the error term $\epsilon_{\text{MDP}}(s,a)$ satisfies the following Bellman-like equation:
注意 1. 对于任何策略 $\pi$ 和状态-动作对 $(s,a)$ ，误差项 $\epsilon_{\text{MDP}}(s,a)$ 满足以下类似贝尔曼方程：

\begin{split}\epsilon_{\text{MDP}}(s,a)=~{}&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime}).\end{split}

(16)

Proof. Proof follows by expanding each $Q$ , rearranging terms and then simplifying the expression.
证明。通过展开每个 $Q$ ，重新排列项，然后简化表达式得到证明。

\begin{split}\epsilon_{\text{MDP}}(s,a)=&~{}Q^{\pi}(s,a)-Q^{\pi}_{\mathcal{B}}(s,a)\\ =&\sum_{s^{\prime}}p_{M}(s^{\prime}|s,a)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}(s^{\prime},a^{\prime})\right)-Q^{\pi}_{\mathcal{B}}(s,a)\\ =&\sum_{s^{\prime}}p_{M}(s^{\prime}|s,a)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}(s^{\prime},a^{\prime})\right)\\ &-\left(\sum_{s^{\prime}}p_{\mathcal{B}}(s^{\prime}|s,a)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\right)\\ =&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)r(s,a,s^{\prime})+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})+\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\right)\\ &-p_{\mathcal{B}}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\\ =&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)r(s,a,s^{\prime})+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})+\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\left(\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})-\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\right)-p_{\mathcal{B}}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\\ =&\sum_{s^{\prime}}\left(p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)\right)\left(r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})\right)\\ &+p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})\end{split}

(17)

Proof. From Remark 1, we note that the form of $\epsilon_{\text{MDP}}(s,a)$ , since no assumptions can be made on the reward function and therefore the expression $r(s,a,s^{\prime})+\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})Q^{\pi}_{\mathcal{B}}(s^{\prime},a^{\prime})$ , we have that $\epsilon_{\text{MDP}}(s,a)=0$ if and only if $p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a)$ for all $s^{\prime}\in\mathcal{S}$ and $p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})=0$ .
证明。根据备注 1，我们注意到 $\epsilon_{\text{MDP}}(s,a)$

$(\Rightarrow)$ Now we note that if $\epsilon_{\text{MDP}}(s,a)=0$ then $p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon_{\text{MDP}}(s^{\prime},a^{\prime})=0$ by the relationship defined by Remark 1 and the condition on the reward function. It follows that we must have $p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a)$ for all $s^{\prime}\in\mathcal{S}$ .
现在我们注意到，如果 $\epsilon_{\text{MDP}}(s,a)=0$ ，那么根据备注 1 和奖励函数的条件定义的关系，我们必须有 $p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a)$ 对于所有 $s^{\prime}\in\mathcal{S}$ 。

$(\Leftarrow)$ If we have $\sum_{s^{\prime}}|p_{M}(s^{\prime}|s,a)-p_{\mathcal{B}}(s^{\prime}|s,a)|=0$ for all $(s,a)$ such that $\mu_{\pi}(s)>0$ and $\pi(a|s)>0$ , then for any $(s,a)$ under the given conditions, we have $\epsilon(s,a)=\sum_{s^{\prime}}p_{M}(s^{\prime}|s,a)\gamma\sum_{a^{\prime}}\pi(a^{\prime}|s^{\prime})\epsilon(s^{\prime},a^{\prime})$ . Recursively expanding the $\epsilon$ term, we arrive at $\epsilon(s,a)=0+\gamma 0+\gamma^{2}0+...=0$ .
如果对于所有这样的情况我们有，那么在给定条件下，对于任何，我们有。递归展开项，我们得到。

Proof. The first part of the Theorem follows from Lemma 1, noting that for a deterministic policy $\pi$ , if $(s,a)\in\mathcal{B}$ then we must have $p_{\mathcal{B}}(s^{\prime}|s,a)=p_{M}(s^{\prime}|s,a)$ for all $s^{\prime}\in\mathcal{S}$ .
证明。定理的第一部分可以从引理 1 得出，注意对于确定性策略 $\pi$

We can construct the batch-constrained policy by selecting $a$ in the state $s\in\mathcal{B}$ , such that $(s,a)\in\mathcal{B}$ . Since the MDP is deterministic and the batch is coherent, when starting from $s_{0}$ , we must be able to follow at least one trajectory until termination.
我们可以通过在状态 $s\in\mathcal{B}$ 中选择 $a$ 来构建受批次约束的策略，使得 $(s,a)\in\mathcal{B}$ 。由于 MDP 是确定性的且批次是连贯的，当从 $s_{0}$ 开始时，我们必须能够至少跟随一条轨迹直至终止。

Proof. Follows from proof of convergence of Q-learning (see Section A.2), noting the batch-constraint is non-restrictive with a batch which contains all possible transitions.
证明。根据 Q-learning 收敛性的证明（见 A.2 节），注意批处理约束是非限制性的，其中批处理包含所有可能的转换。

Proof. Results follows from Theorem 1, which states Q-learning learns the optimal value for the MDP $M_{\mathcal{B}}$ for state-action pairs in $(s,a)$ . However, for a deterministic MDP $M_{\mathcal{B}}$ corresponds to the true MDP in all seen state-action pairs. Noting that batch-constrained policies operate only on state-action pairs where $M_{\mathcal{B}}$ corresponds to the true MDP, it follows that $\pi^{*}$ will be the optimal batch-constrained policy from the optimality of Q-learning.
证明。结果来自定理 1，该定理指出 Q-learning 学习 MDP $M_{\mathcal{B}}$ 中状态-动作对的最优值。然而，对于确定性 MDP $M_{\mathcal{B}}$ ，它在所有已见的状态-动作对中对应于真实 MDP。注意，批量约束策略仅在 $M_{\mathcal{B}}$ 对应于真实 MDP 的状态-动作对上运行，由于 Q-learning 的最优性， $\pi^{*}$ 将成为最优的批量约束策略。

A.2 Sketch of the Proof of Convergence of Q-Learning
Q 学习收敛证明草图

The proof of convergence of Q-learning relies large on the following lemma (Singh et al., 2000):
Q-learning 的收敛性证明在很大程度上依赖于以下引理（Singh 等人，2000 年）：

Lemma 2. Consider a stochastic process $(\zeta_{t},\Delta_{t},F_{t}),t\geq 0$ where $\zeta_{t},\Delta_{t},F_{t}:X\rightarrow\mathbb{R}$ satisfy the equation:
引理 2. 考虑一个随机过程 $(\zeta_{t},\Delta_{t},F_{t}),t\geq 0$ ，其中 $\zeta_{t},\Delta_{t},F_{t}:X\rightarrow\mathbb{R}$ 满足方程：

\Delta_{t+1}(x_{t})=(1-\zeta_{t}(x_{t}))\Delta_{t}(x_{t})+\zeta_{t}(x_{t})F_{t}(x_{t}),

(18)

where $x_{t}\in X$ and $t=0,1,2,...$ . Let $P_{t}$ be a sequence of increasing $\sigma$ -fields such that $\zeta_{0}$ and $\Delta_{0}$ are $P_{0}$ -measurable and $\zeta_{t},\Delta_{t}$ and $F_{t-1}$ are $P_{t}$ -measurable, $t=1,2,...$ . Assume that the following hold:
设 $x_{t}\in X$ 和 $t=0,1,2,...$ 。让 $P_{t}$ 是一个递增的 $\sigma$ -域序列，使得 $\zeta_{0}$ 和 $\Delta_{0}$ 是 $P_{0}$ -可测的， $\zeta_{t},\Delta_{t}$ 和 $F_{t-1}$ 是 $P_{t}$ -可测的， $t=1,2,...$ 。假设以下条件成立：

1.

The set $X$ is finite.

集合 $X$ 是有限的。
2.

$\zeta_{t}(x_{t})\in[0,1],\sum_{t}\zeta_{t}(x_{t})=\infty,\sum_{t}(\zeta_{t}(x_{t}))^{2}<\infty$ with probability $1$ and $\forall x\neq x_{t}:\zeta(x)=0$ .

以概率 $1$ 和 $\forall x\neq x_{t}:\zeta(x)=0$ 。
3.

$||\mathbb{E}\left[F_{t}|P_{t}\right]||\leq\kappa||\Delta_{t}||+c_{t}$ where $\kappa\in[0,1)$ and $c_{t}$ converges to $0$ with probability $1$ .

3. 在概率 $1$ 的情况下， $\kappa\in[0,1)$ 和 $c_{t}$ 收敛于 $0$ 。
4.

Var $\left[F_{t}(x_{t})|P_{t}\right]\leq K(1+\kappa||\Delta_{t}||)^{2}$ , where $K$ is some constant

4. 变量 $\left[F_{t}(x_{t})|P_{t}\right]\leq K(1+\kappa||\Delta_{t}||)^{2}$ ，其中 $K$ 是某个常数

Where $||\cdot||$ denotes the maximum norm. Then $\Delta_{t}$ converges to $0$ with probability $1$ .
其中 $||\cdot||$ 表示最大范数。然后 $\Delta_{t}$ 以概率 $1$ 收敛到 $0$ 。

Sketch of Proof of Convergence of Q-Learning. We set $\Delta_{t}=Q_{t}(s,a)-Q^{*}(s,a)$ . Then convergence follows by satisfying the conditions of Lemma 2. Condition 1 is satisfied by the finite MDP, setting $X=\mathcal{S}\times\mathcal{A}$ . Condition 2 is satisfied by the assumption of Robbins-Monro stochastic convergence conditions on the learning rate $\alpha_{t}$ , setting $\zeta_{t}=\alpha_{t}$ . Condition 4 is satisfied by the bounded reward function, where $F_{t}(s,a)=r(s,a,s^{\prime})+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q^{*}(s,a)$ , and the sequence $P_{t}=\{Q_{0},s_{0},a_{0},\alpha_{0},r_{1},s_{1},...s_{t},a_{t}\}$ . Finally, Condition 3 follows from the contraction of the Bellman Operator $\mathcal{T}$ , requiring infinite state-action visitation, infinite updates and $\gamma<1$ .
Q 学习收敛证明草图。我们设定 $\Delta_{t}=Q_{t}(s,a)-Q^{*}(s,a)$ 。然后通过满足引理 2 的条件来实现收敛。条件 1 由有限 MDP 满足，设定 $X=\mathcal{S}\times\mathcal{A}$ 。条件 2 由 Robbins-Monro 随机收敛条件对学习率 $\alpha_{t}$ 的假设满足，设定 $\zeta_{t}=\alpha_{t}$ 。条件 4 由有界奖励函数满足，其中 $F_{t}(s,a)=r(s,a,s^{\prime})+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q^{*}(s,a)$ ，以及序列 $P_{t}=\{Q_{0},s_{0},a_{0},\alpha_{0},r_{1},s_{1},...s_{t},a_{t}\}$ 。最后，条件 3 由贝尔曼算子 $\mathcal{T}$ 的收缩得出，需要无限状态-动作访问、无限更新和 $\gamma<1$ 。

Additional and more complete details can be found in numerous resources (Dayan & Watkins, 1992; Singh et al., 2000; Melo, 2001).
在许多资源中可以找到更多和更完整的细节（Dayan & Watkins, 1992; Singh 等，2000; Melo, 2001）。

Appendix B Missing Graphs 附录 B 缺失的图表

B.1 Extrapolation Error in Deep Reinforcement Learning
深度强化学习中的外推误差

B.2 Complete Experimental Results
B.2 完整实验结果

We present the complete set of results across each task and environment in Figure 5. These results show that BCQ successful mitigates extrapolation error and learns from a variety of fixed batch settings. Although BCQ with our current hyper-parameters was never found to fail, we noted with slight changes to hyper-parameters, BCQ failed periodically on the concurrent learning task in the HalfCheetah-v1 environment, exhibiting instability in the value function after $750,000$ or more iterations on some seeds. We hypothesize that this instability could occur if the generative model failed to output in-distribution actions, and could be corrected through additional training or improvements to the vanilla VAE. Interestingly, BCQ still performs well in these instances, due to the behavioral cloning-like elements in the algorithm.
我们在图 5 中展示了每个任务和环境中的完整结果集。这些结果表明，BCQ 成功地减轻了外推错误，并从各种固定批处理设置中学习。尽管我们当前的超参数下 BCQ 从未失败过，但我们注意到稍微调整超参数后，BCQ 在 HalfCheetah-v1 环境中的并发学习任务中偶尔会失败，在某些种子上经过 $750,000$ 次迭代后表现出值函数的不稳定性。我们假设如果生成模型未能输出分布内的动作，则可能会出现这种不稳定性，并且可以通过额外的训练或改进基础 VAE 来纠正。有趣的是，由于算法中的行为克隆元素，BCQ 在这些情况下仍表现良好。

Appendix C Extrapolation Error in Kernel-Based Reinforcement Learning
基于核的强化学习中的外推误差

This problem of extrapolation persists in traditional batch reinforcement learning algorithms, such as kernel-based reinforcement learning (KBRL) (Ormoneit & Sen, 2002). For a given batch $\mathcal{B}$ of transitions $(s,a,r,s^{\prime})$ , non-negative density function $\phi:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ , hyper-parameter $\tau\in\mathbb{R}$ , and norm $||\cdot||$ , KBRL evaluates the value of a state-action pair (s,a) as follows:
传统的批量强化学习算法中存在这种外推问题，比如基于核的强化学习（KBRL）（Ormoneit & Sen，2002）。对于给定的一批转换 $\mathcal{B}$ ，非负密度函数 $\phi:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ ，超参数 $\tau\in\mathbb{R}$ 和范数 $||\cdot||$ ，KBRL 评估状态-动作对（s，a）的值如下：

	$\displaystyle Q(s,a)$	$\displaystyle=\sum_{(s_{\mathcal{B}}^{a},a,r,s_{\mathcal{B}}^{\prime})\in\mathcal{B}}\kappa^{a}_{\tau}(s,s_{\mathcal{B}}^{a})[r+\gamma V(s_{\mathcal{B}}^{\prime})],$		(19)
	$\displaystyle\kappa_{\tau}^{a}(s,s_{\mathcal{B}}^{a})$	$\displaystyle=\frac{k_{\tau}(s,s_{\mathcal{B}}^{a})}{\sum_{\tilde{s}_{\mathcal{B}}^{a}}k_{\tau}(s,\tilde{s}_{\mathcal{B}}^{a})},\qquad k_{\tau}(s,s_{\mathcal{B}}^{a})=\phi\left(\frac{\|\|s-s_{\mathcal{B}}^{a}\|\|}{\tau}\right),$		(20)

where $s_{\mathcal{B}}^{a}\in\mathcal{S}$ represents states corresponding to the action $a$ for some tuple $(s_{\mathcal{B}},a)\in\mathcal{B}$ , and $V(s_{\mathcal{B}}^{\prime})=\max_{a\text{ s.t.}(s_{\mathcal{B}}^{\prime},a)\in\mathcal{B}}Q(s_{\mathcal{B}}^{\prime},a)$ . At each iteration, KBRL updates the estimates of $Q(s_{\mathcal{B}},a_{\mathcal{B}})$ for all $(s_{\mathcal{B}},a_{\mathcal{B}})\in\mathcal{B}$ following Equation (19), then updates $V(s_{\mathcal{B}}^{\prime})$ by evaluating $Q(s_{\mathcal{B}}^{\prime},a)$ for all $s_{\mathcal{B}}\in\mathcal{B}$ and $a\in\mathcal{A}$ .
其中 $s_{\mathcal{B}}^{a}\in\mathcal{S}$ 表示与某个元组 $(s_{\mathcal{B}},a)\in\mathcal{B}$ 和 $V(s_{\mathcal{B}}^{\prime})=\max_{a\text{ s.t.}(s_{\mathcal{B}}^{\prime},a)\in\mathcal{B}}Q(s_{\mathcal{B}}^{\prime},a)$ 对应的动作 $a$ 的状态。在每次迭代中，KBRL 根据方程（19）更新所有 $(s_{\mathcal{B}},a_{\mathcal{B}})\in\mathcal{B}$ 的 $Q(s_{\mathcal{B}},a_{\mathcal{B}})$ 估计值，然后通过评估所有 $s_{\mathcal{B}}\in\mathcal{B}$ 和 $a\in\mathcal{A}$ 的 $Q(s_{\mathcal{B}}^{\prime},a)$ 来更新 $V(s_{\mathcal{B}}^{\prime})$ 。

Figure 6: Toy MDP with two states

s_{0}

and

s_{1}

, and two actions

a_{0}

and

a_{1}

. Agent receives reward of

1

for selecting

a_{1}

s_{0}

and

0

otherwise.
图 6：玩具 MDP 具有两个状态

s_{0}

和

s_{1}

，以及两个动作

a_{0}

和

a_{1}

。代理在

s_{0}

选择

a_{1}

时获得奖励

1

，否则获得奖励

0

。

Given access to the entire deterministic MDP, KBRL will provable converge to the optimal value, however when limited to only a subset, we find the value estimation susceptible to extrapolation. In Figure 6, we provide a deterministic two state, two action MDP in which KBRL fails to learn the optimal policy when provided with state-action pairs from the optimal policy. Given the batch $\{(s_{0},a_{1},r=1,s_{1}),(s_{1},a_{0},r=0,s_{0})\}$ , corresponding to the optimal behavior, and noting that there is only one example of each action, Equation (19) provides the following:
给定对整个确定性 MDP 的访问权限，KBRL 将能够收敛到最优值，但当仅限于子集时，我们发现值估计容易出现外推。在图 6 中，我们提供了一个确定性的两状态、两动作 MDP，在这个 MDP 中，当提供最优策略的状态-动作对时，KBRL 无法学习到最优策略。给定批次 $\{(s_{0},a_{1},r=1,s_{1}),(s_{1},a_{0},r=0,s_{0})\}$

Q(\cdot,a_{1})=1+\gamma V(s_{1})=1+\gamma Q(s_{1},a_{0}),\qquad Q(\cdot,a_{0})=\gamma V(s_{0})=\gamma Q(s_{0},a_{1}).

(21)

After sufficient iterations KBRL will converge correctly to $Q(s_{0},a_{1})=\frac{1}{1-\gamma^{2}}$ , $Q(s_{1},a_{0})=\frac{\gamma}{1-\gamma^{2}}$ . However, when evaluating actions, KBRL erroneously extrapolates the values of each action $Q(\cdot,a_{1})=\frac{1}{1-\gamma^{2}}$ , $Q(\cdot,a_{0})=\frac{\gamma}{1-\gamma^{2}}$ , and its behavior, $\operatorname*{argmax}_{a}Q(s,a)$ , will result in the degenerate policy of continually selecting $a_{1}$ . KBRL fails this example by estimating the values of unseen state-action pairs. In methods where the extrapolated estimates can be included into the learning update, such fitted Q-iteration or DQN (Ernst et al., 2005; Mnih et al., 2015), this could cause an unbounded sequence of value estimates, as demonstrated by our results in Section 3.1.
经过足够的迭代，KBRL 将正确收敛到 $Q(s_{0},a_{1})=\frac{1}{1-\gamma^{2}}$ ， $Q(s_{1},a_{0})=\frac{\gamma}{1-\gamma^{2}}$ 。然而，在评估行动时，KBRL 错误地推断了每个行动 $Q(\cdot,a_{1})=\frac{1}{1-\gamma^{2}}$ ， $Q(\cdot,a_{0})=\frac{\gamma}{1-\gamma^{2}}$ 的价值，其行为 $\operatorname*{argmax}_{a}Q(s,a)$ 将导致不断选择 $a_{1}$ 的退化策略。KBRL 未能通过估计未见状态-行动对的价值来完成这个例子。在一些方法中，可以将推断的估计值包含到学习更新中，比如拟合 Q-迭代或 DQN（Ernst 等，2005 年；Mnih 等，2015 年），这可能导致一个无界的价值估计序列，正如我们在第 3.1 节的结果中所展示的。

Appendix D Additional Experiments
附录 D 附加实验

D.1 Ablation Study of Perturbation Model
扰动模型消融研究

BCQ includes a perturbation model $\xi_{\theta}(s,a,\Phi)$ which outputs a small residual update to the actions sampled by the generative model in the range $[-\Phi,\Phi]$ . This enables the policy to select actions which may not have been sampled by the generative model. If $\Phi=a_{\text{max}}-a_{\text{min}}$ , then all actions can be plausibly selected by the model, similar to standard deep reinforcement learning algorithms, such as DQN and DDPG (Mnih et al., 2015; Lillicrap et al., 2015). In Figure 7 we examine the performance and value estimates of BCQ when varying the hyper-parameter $\Phi$ , which corresponds to how much the model is able to move away from the actions sampled by the generative model.
BCQ 包括一个扰动模型 $\xi_{\theta}(s,a,\Phi)$ ，它输出一个小的残差更新到由生成模型采样的动作范围内 $[-\Phi,\Phi]$ 。这使得策略能够选择可能未被生成模型采样的动作。如果 $\Phi=a_{\text{max}}-a_{\text{min}}$ ，那么模型可以合理地选择所有动作，类似于标准的深度强化学习算法，如 DQN 和 DDPG（Mnih 等，2015; Lillicrap 等，2015）。在图 7 中，我们研究了 BCQ 在变化超参数 $\Phi$ 时的性能和值估计，该超参数对应于模型能够远离生成模型采样的动作的程度。

We observe a clear drop in performance with the increase of $\Phi$ , along with an increase in instability in the value function. Given that the data is based on expert performance, this is consistent with our understanding of extrapolation error. With larger $\Phi$ the agent learns to take actions that are further away from the data in the batch after erroneously overestimating the value of suboptimal actions. This suggests the ideal value of $\Phi$ should be small enough to stay close to the generated actions, but large enough such that learning can be performed when exploratory actions are included in the dataset.
我们观察到性能随着 $\Phi$ 的增加而明显下降，价值函数的不稳定性也增加。鉴于数据基于专家表现，这与我们对外推误差的理解一致。随着 $\Phi$ 的增加，代理学习采取离批处理数据更远的行动，这是因为错误地高估了次优行动的价值。这表明 $\Phi$ 的理想值应该足够小，以便保持接近生成的行动，但又足够大，以便在数据集中包含探索性行动时进行学习。

D.2 Uncertainty Estimation for Batch-Constrained Reinforcement Learning
批次约束强化学习的不确定性估计

Section 4.2 proposes using generation as a method for constraining the output of the policy $\pi$ to eliminate actions which are unlikely under the batch. However, a more natural approach would be through approximate uncertainty-based methods (Osband et al., 2016; Gal et al., 2016; Azizzadenesheli et al., 2018). These methods are well-known to be effective for exploration, however we examine their properties for the exploitation where we would like to avoid uncertain actions.
4.2 节提出使用生成作为一种方法来约束政策 $\pi$ 的输出，以消除在批处理下不太可能发生的动作。然而，一个更自然的方法是通过近似基于不确定性的方法（Osband 等，2016 年；Gal 等，2016 年；Azizzadenesheli 等，2018 年）。这些方法被广泛认为对于探索是有效的，然而我们研究它们在利用方面的特性，我们希望避免不确定的动作。

To measure the uncertainty of the value network, we use ensemble-based methods, with an ensemble of size $4$ and $10$ to mimic the models used by Buckman et al. (2018) and Osband et al. (2016) respectively. Each network is trained with separate mini-batches, following the standard deep Q-learning update with a target network. These networks use the default architecture and hyper-parameter choices as defined in Section G. The policy $\pi_{\phi}$ is trained to minimize the standard deviation $\sigma$ across the ensemble:
为了衡量价值网络的不确定性，我们使用基于集成的方法，使用大小为 $4$ 和 $10$ 的集成来模拟 Buckman 等人（2018 年）和 Osband 等人（2016 年）使用的模型。每个网络都使用单独的小批量进行训练，遵循标准的深度 Q 学习更新与目标网络。这些网络使用在第 G 节中定义的默认架构和超参数选择。策略 $\pi_{\phi}$ 被训练以最小化整个集成中的标准差 $\sigma$ ：

\phi\leftarrow\operatorname*{argmin}_{\phi}\sum_{(s,a)\in\mathcal{B}}\sigma\left(\{Q_{\theta_{i}}(s,a)\}_{i=1}^{N}\right).

(22)

If the ensembles were a perfect estimate of the uncertainty, the policy would learn to select the most certain action for a given state, minimizing the extrapolation error and effectively imitating the data in the batch.
如果集成是对不确定性的完美估计，那么策略将学会为给定状态选择最确定的动作，从而最小化外推误差，并有效地模仿批处理中的数据。

To test these uncertainty-based methods, we examine their performance on the imitation task in the Hopper-v1 environment (Todorov et al., 2012; Brockman et al., 2016). In which a dataset of 1 million expert transitions are provided to the agents. Additional experimental details can be found in Section F. The performance, alongside the value estimates of the agents are displayed in Figure 8.
为了测试这些基于不确定性的方法，我们在 Hopper-v1 环境中对它们的表现进行了检验（Todorov 等，2012 年；Brockman 等，2016 年）。在这个环境中，我们向代理提供了 100 万个专家转换的数据集。更多实验细节请参见第 F 节。代理的表现以及价值估计显示在图 8 中。

We find that neither ensemble method is sufficient to constrain the action space to only the expert actions. However, the value function is stabilized, suggesting that ensembles are an effective strategy for eliminating outliers or large deviations in the value from erroneous extrapolation. Unsurprisingly, the large ensemble provides a more accurate estimate of the uncertainty. While scaling the size of the ensemble to larger values could possibly enable an effective batch-constraint, increasing the size of the ensemble induces a large computational cost. Finally, in this task, where only expert data is provided, the policy can attempt to imitate the data without consideration of the value, however in other tasks, a weighting between value and uncertainty would need to be carefully tuned. On the other hand, BCQ offers a computational cheap approach without requirements for difficult hyper-parameter tuning.
我们发现，两种集成方法都不足以将行动空间限制在仅专家行动上。然而，价值函数得到稳定，表明集成是消除价值误差外点或大偏差的有效策略。毫不奇怪，大型集成提供了对不确定性更准确的估计。虽然将集成大小扩展到更大的值可能会实现有效的批量约束，但增加集成大小会带来巨大的计算成本。最后，在这个任务中，只提供专家数据，策略可以尝试模仿数据而不考虑价值，然而在其他任务中，价值和不确定性之间的加权需要仔细调整。另一方面，BCQ 提供了一种计算成本低廉的方法，无需进行困难的超参数调整。

D.3 Random Behavioral Policy Study
D.3 随机行为策略研究

The experiments in Section 3.1 and 5 use a learned, or partially learned, behavioral policy for data collection. This is a necessary requirement for learning meaningful behavior, as a random policy generally fails to provide sufficient coverage over the state space. However, in simple toy problems, such as the pendulum swing-up task and the reaching task with a two-joint arm from OpenAI gym (Brockman et al., 2016), a random policy can sufficiently explore the environment, enabling us to examine the properties of algorithms with entirely non-expert data.
在第 3.1 节和第 5 节的实验中，使用了一个学习的或部分学习的行为策略来进行数据收集。这是学习有意义行为的必要条件，因为随机策略通常无法提供足够的状态空间覆盖。然而，在简单的玩具问题中，例如摆动摆杆任务和 OpenAI gym 中具有两个关节的手臂的到达任务（Brockman 等，2016），随机策略可以足够地探索环境，使我们能够检查完全非专家数据的算法属性。

In Figure 9, we examine the performance of our algorithm, BCQ, as well as DDPG (Lillicrap et al., 2015), on these two toy problems, when learning off-policy from a small batch of 5000 time steps, collected entirely by a random policy. We find that both BCQ and DDPG are able to learn successfully in these off-policy tasks. These results suggest that BCQ is less restrictive than imitation learning algorithms, which require expert data to learn. We also find that unlike previous environments, given the small scale of the state and action space, the random policy is able to provide sufficient coverage for DDPG to learn successfully.
在图 9 中，我们研究了我们的算法 BCQ 以及 DDPG（Lillicrap 等，2015 年）在这两个玩具问题上的表现，当从一个由随机策略完全收集的 5000 个时间步的小批量数据中学习离线策略。我们发现 BCQ 和 DDPG 都能够成功地学习这些离线任务。这些结果表明，BCQ 比需要专家数据来学习的模仿学习算法更具灵活性。我们还发现，与以前的环境不同，鉴于状态和动作空间的规模较小，随机策略能够为 DDPG 提供足够的覆盖范围，以便成功学习。

Appendix E Missing Background
附录 E 缺少背景资料

E.1 Variational Auto-Encoder
变分自动编码器

A variational auto-encoder (VAE) (Kingma & Welling, 2013) is a generative model which aims to maximize the marginal log-likelihood $\log p(X)=\sum_{i=1}^{N}\log p(x_{i})$ where $X=\{x_{1},...,x_{N}\}$ , the dataset. While computing the marginal likelihood is intractable in nature, we can instead optimize the variational lower-bound:
变分自动编码器（VAE）（Kingma＆Welling，2013）是一种生成模型，旨在最大化边际对数似然 $\log p(X)=\sum_{i=1}^{N}\log p(x_{i})$ ，其中 $X=\{x_{1},...,x_{N}\}$ 是数据集。虽然计算边际似然在本质上是不可解的，但我们可以优化变分下界：

\log p(X)\geq\mathbb{E}_{q(X|z)}[\log p(X|z)]+D_{\text{KL}}(q(z|X)||p(z)),

(23)

where $p(z)$ is chosen a prior, generally the multivariate normal distribution $\mathcal{N}(0,I)$ . We define the posterior $q(z|X)=\mathcal{N}(z|\mu(X),\sigma^{2}(X)I)$ as the encoder and $p(X|z)$ as the decoder. Simply put, this means a VAE is an auto-encoder, where a given sample $x$ is passed through the encoder to produce a random latent vector $z$ , which is given to the decoder to reconstruct the original sample $x$ . The VAE is trained on a reconstruction loss, and a KL-divergence term according to the distribution of the latent vectors. To perform gradient descent on the variational lower bound we can use the re-parametrization trick (Kingma & Welling, 2013; Rezende et al., 2014):
其中 $p(z)$ 通常是先验选择的多元正态分布 $\mathcal{N}(0,I)$ 。我们将后验定义为编码器 $q(z|X)=\mathcal{N}(z|\mu(X),\sigma^{2}(X)I)$ ，解码器定义为 $p(X|z)$ 。简单来说，这意味着 VAE 是一个自动编码器，其中给定样本 $x$ 通过编码器生成一个随机潜在向量 $z$ ，然后传递给解码器以重构原始样本 $x$ 。VAE 是根据潜在向量的分布进行重建损失和 KL 散度项训练的。为了在变分下界上执行梯度下降，我们可以使用重参数化技巧（Kingma＆Welling，2013; Rezende 等，2014）：

\mathbb{E}_{z\sim\mathcal{N}(\mu,\sigma)}[f(z)]=\mathbb{E}_{\epsilon\sim\mathcal{N}(0,I)}[f(\mu+\sigma\epsilon)].

(24)

This formulation allows for back-propagation through stochastic nodes, by noting $\mu$ and $\sigma$ can be represented by deterministic functions. During inference, random values of $z$ are sampled from the multivariate normal and passed through the decoder to produce samples $x$ .
该公式允许通过随机节点进行反向传播，通过注意到 $\mu$ 和 $\sigma$ 可以用确定性函数表示。在推断过程中，从多元正态分布中抽样得到 $z$ 的随机值，并通过解码器传递，生成样本 $x$ 。

Appendix F Experimental Details
附录 F 实验细节

Each environment is run for 1 million time steps, unless stated otherwise, with evaluations every 5000 time steps, where an evaluation measures the average reward from 10 episodes with no exploration noise. Our results are reported over 5 random seeds of the behavioral policy, OpenAI Gym simulator and network initialization. Value estimates are averaged over mini-batches of $100$ and sampled every $2500$ iterations. The true value is estimated by sampling $100$ state-action pairs from the buffer replay and computing the discounted return by running the episode until completion while following the current policy.
每个环境运行 100 万个时间步，除非另有说明，评估每 5000 个时间步进行一次，评估测量 10 个不带探索噪声的情节的平均奖励。我们的结果基于行为策略、OpenAI Gym 模拟器和网络初始化的 5 个随机种子。值估计在小批次中进行平均，并每隔一定迭代次数进行抽样。通过从缓冲重放中抽样状态-动作对，并在遵循当前策略的情况下运行情节直至完成来估计真实值。

Each agent is trained after each episode by applying one training iteration per each time step in the episode. The agent is trained with transition tuples $(s,a,r,s^{\prime})$ sampled from an experience replay that is defined by each experiment. We define four possible experiments. Unless stated otherwise, default implementation settings, as defined in Section G, are used.
每个代理在每一集结束后都会接受训练，每个时间步长进行一次训练迭代。代理会使用从经验重放中抽样的转换元组 $(s,a,r,s^{\prime})$ 进行训练，该经验重放由每个实验定义。我们定义了四种可能的实验。除非另有说明，否则将使用在第 G 节中定义的默认实现设置。

Batch 1 (Final buffer). We train a DDPG (Lillicrap et al., 2015) agent for 1 million time steps, adding large amounts of Gaussian noise $(\mathcal{N}(0,0.5))$ to induce exploration, and store all experienced transitions in a buffer replay. This training procedure creates a buffer replay with a diverse set of states and actions. A second, randomly initialized agent is trained using the 1 million stored transitions.
批次 1（最终缓冲区）。我们训练一个 DDPG（Lillicrap 等人，2015 年）代理器进行 100 万个时间步长，添加大量的高斯噪声 $(\mathcal{N}(0,0.5))$ 以引发探索，并将所有经验转换存储在一个缓冲重放中。这种训练过程创建了一个具有多样化状态和动作集的缓冲重放。第二个随机初始化的代理器使用这 100 万个存储的转换进行训练。

Batch 2 (Concurrent learning). We simultaneously train two agents for 1 million time steps, the first DDPG agent, performs data collection and each transition is stored in a buffer replay which both agents learn from. This means the behavioral agent learns from the standard training regime for most off-policy deep reinforcement learning algorithms, and the second agent is learning off-policy, as the data is collected without direct relationship to its current policy. Batch 2 differs from Batch 1 as the agents are trained with the same version of the buffer, while in Batch 1 the agent learns from the final buffer replay after the behavioral agent has finished training.
批次 2（并行学习）。我们同时训练两个代理器进行 100 万个时间步，第一个 DDPG 代理器执行数据收集，每个转换都存储在一个缓冲重放中，两个代理器都从中学习。这意味着行为代理器从大多数基于离线深度强化学习算法的标准训练方案中学习，第二个代理器是离线学习的，因为数据是在没有直接关系到其当前策略的情况下收集的。批次 2 与批次 1 不同之处在于代理器使用相同版本的缓冲进行训练，而在批次 1 中，代理器在行为代理器完成训练后从最终缓冲重放中学习。

Batch 3 (Imitation). A DDPG agent is trained for 1 million time steps. The trained agent then acts as an expert policy, and is used to collect a dataset of 1 million transitions. This dataset is used to train a second, randomly initialized agent. In particular, we train DDPG across 15 seeds, and select the 5 top performing seeds as the expert policies.
第三批（模拟）。一个 DDPG 代理经过 100 万个时间步的训练。训练好的代理然后充当专家策略，并用于收集 100 万个转换的数据集。该数据集用于训练第二个随机初始化的代理。具体来说，我们在 15 个种子上训练 DDPG，并选择表现最好的 5 个种子作为专家策略。

Batch 4 (Imperfect demonstrations). The expert policies from Batch 3 are used to collect a dataset of 100k transitions, while selecting actions randomly with probability $0.3$ and adding Gaussian noise $\mathcal{N}(0,0.3)$ to the remaining actions. This dataset is used to train a second, randomly initialized agent.
第四批（不完美演示）。使用第三批的专家策略收集了一个包含 10 万个转换的数据集，同时以概率 $0.3$

Appendix G Implementation Details
附录 G 实施细节

Across all methods and experiments, for fair comparison, each network generally uses the same hyper-parameters and architecture, which are defined in Table 1 and Figure 10 respectively. The value functions follow the standard practice (Mnih et al., 2015) in which the Bellman update differs for terminal transitions. When the episode ends by reaching some terminal state, the value is set to $0$ in the learning target $y$ :
在所有方法和实验中，为了公平比较，每个网络通常使用相同的超参数和架构，分别在表 1 和图 10 中定义。值函数遵循标准做法（Mnih 等，2015 年），其中贝尔曼更新在终端转换时有所不同。当通过到达某个终端状态结束剧集时，值被设置为 $0$ 在学习目标 $y$ 中：

y=\begin{cases}r&\text{if terminal }s^{\prime}\\ r+\gamma Q_{\theta^{\prime}}(s^{\prime},a^{\prime})&\text{else}\\ \end{cases}

(25)

Where the termination signal from time-limited environments is ignored, thus we only consider a state $s_{t}$ terminal if $t<$ max horizon.
在忽略来自有时间限制环境的终止信号的情况下，我们只考虑状态 $s_{t}$ 在 $t<$ 最大时间范围内终止。

Table 1: Default hyper-parameters.
表 1：默认超参数。

Hyper-parameter 超参数	Value 价值
Optimizer 优化器	Adam (Kingma & Ba, 2014) 亚当（Kingma & Ba，2014）
Learning Rate 学习率	$10^{-3}$
Batch Size 批量大小	$100$
Normalized Observations 标准化观测	False 错误
Gradient Clipping 梯度裁剪	False 错误
Discount Factor 折现因子	$0.99$
Target Update Rate ( $\tau$ ) 目标更新频率 ( $\tau$ )	$0.005$
Exploration Policy 探索政策	$\mathcal{N}(0,0.1)$

(input dimension, 400) （输入维度，400） ReLU (400, 300) RelU (300, output dimension) （300，输出维度）

Figure 10: Default network architecture, based on DDPG (Lillicrap et al., 2015). All actor networks are followed by a tanh

\cdot

max action size
图 10：基于 DDPG（Lillicrap 等，2015）的默认网络架构。所有 actor 网络后面都跟着一个 tanh

\cdot

最大动作尺寸

BCQ. BCQ uses four main networks: a perturbation model $\xi_{\phi}(s,a)$ , a state-conditioned VAE $G_{\omega}(s)$ and a pair of value networks $Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a)$ . Along with three corresponding target networks $\xi_{\phi^{\prime}}(s,a),Q_{\theta^{\prime}_{1}}(s,a),Q_{\theta^{\prime}_{2}}(s,a)$ . Each network, other than the VAE follows the default architecture (Figure 10) and the default hyper-parameters (Table 1). For $\xi_{\phi}(s,a,\Phi)$ , the constraint $\Phi$ is implemented through a tanh activation multiplied by $I\cdot\Phi$ following the final layer.
BCQ。BCQ 使用四个主要网络：一个扰动模型 $\xi_{\phi}(s,a)$ ，一个状态条件的 VAE $G_{\omega}(s)$ 和一对价值网络 $Q_{\theta_{1}}(s,a),Q_{\theta_{2}}(s,a)$ 。还有三个对应的目标网络 $\xi_{\phi^{\prime}}(s,a),Q_{\theta^{\prime}_{1}}(s,a),Q_{\theta^{\prime}_{2}}(s,a)$ 。除了 VAE 之外的每个网络都遵循默认架构（图 10）和默认超参数（表 1）。对于 $\xi_{\phi}(s,a,\Phi)$ ，约束 $\Phi$ 通过最终层后的 tanh 激活乘以 $I\cdot\Phi$ 来实现。

As suggested by Fujimoto et al. (2018), the perturbation model is trained only with respect to $Q_{\theta_{1}}$ , following the deterministic policy gradient algorithm (Silver et al., 2014), by performing a residual update on a single action sampled from the generative model:
根据 Fujimoto 等人（2018）的建议，扰动模型仅针对 $Q_{\theta_{1}}$ 进行训练，遵循确定性策略梯度算法（Silver 等人，2014），通过对从生成模型中抽样的单个动作执行残差更新：

\begin{split}\phi&\leftarrow\operatorname*{argmax}_{\phi}\sum_{(s,a)\in\mathcal{B}}Q_{\theta_{1}}(s,a+\xi_{\phi}(s,a,\Phi)),\\ &a\sim G_{\omega}(s).\end{split}

(26)

To penalize uncertainty over future states, we train a pair of value estimates $\{Q_{\theta_{1}},Q_{\theta_{2}}\}$ and take a weighted minimum between the two values as a learning target $y$ for both networks. First $n$ actions are sampled with respect to the generative model, and then adjusted by the target perturbation model, before passed to each target Q-network:
为了惩罚对未来状态的不确定性，我们训练一对价值估计 $\{Q_{\theta_{1}},Q_{\theta_{2}}\}$ ，并将两个值之间的加权最小值作为两个网络的学习目标 $y$ 。首先，根据生成模型对 $n$ 个动作进行采样，然后通过目标扰动模型进行调整，然后传递给每个目标 Q 网络：

\begin{split}y&=r+\gamma\max_{ai}\left[\lambda\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},\tilde{a}_{i})+(1-\lambda)\max_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},\tilde{a}_{i})\right],\\ &\{\tilde{a}_{i}=a_{i}+\xi_{\phi^{\prime}}(s,a_{i},\Phi)\}_{i=0}^{n},\qquad\{a_{i}\sim G_{\omega}(s^{\prime})\}_{i=0}^{n},\\ &\mathcal{L}_{\text{value,i}}=\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}(y-Q_{\theta_{i}}(s,a))^{2}.\end{split}

(27)

Both networks are trained with the same target $y$ , where $\lambda=0.75$ .
两个网络都是用相同的目标 $y$ 进行训练，其中 $\lambda=0.75$ 。

The VAE $G_{\omega}$ is defined by two networks, an encoder $E_{\omega_{1}}(s,a)$ and decoder $D_{\omega_{2}}(s,z)$ , where $\omega=\{\omega_{1},\omega_{2}\}$ . The encoder takes a state-action pair and outputs the mean $\mu$ and standard deviation $\sigma$ of a Gaussian distribution $\mathcal{N}(\mu,\sigma)$ . The state $s$ , along with a latent vector $z$ is sampled from the Gaussian, is passed to the decoder $D_{\omega_{2}}(s,z)$ which outputs an action. Each network follows the default architecture (Figure 10), with two hidden layers of size $750$ , rather than $400$ and $300$ . The VAE is trained with respect to the mean squared error of the reconstruction along with a KL regularization term:
VAE 由两个网络定义，一个编码器和解码器，其中。编码器接收状态-动作对并输出高斯分布的均值和标准差。状态，连同从高斯中采样的潜在向量，被传递给解码器，后者输出一个动作。每个网络都遵循默认架构（图 10），具有大小为的两个隐藏层，而不是和。VAE 根据重建的均方误差和 KL 正则化项进行训练：

$\displaystyle\mathcal{L}_{\text{reconstruction}}$	$\displaystyle=\sum_{(s,a)\in\mathcal{B}}(D_{\omega_{2}}(s,z)-a)^{2},\quad z=\mu+\sigma\cdot\epsilon,\quad\epsilon\sim\mathcal{N}(0,1),$	(28)
$\displaystyle\mathcal{L}_{\text{KL}}$	$\displaystyle=D_{\text{KL}}(\mathcal{N}(\mu,\sigma)\|\|\mathcal{N}(0,1)),$	(29)
$\displaystyle\mathcal{L}_{\text{VAE}}$	$\displaystyle=\mathcal{L}_{\text{reconstruction}}+\lambda\mathcal{L}_{\text{KL}}.$	(30)

Noting the Gaussian form of both distributions, the KL divergence term can be simplified (Kingma & Welling, 2013):
考虑到两个分布的高斯形式，KL 散度项可以简化（Kingma＆Welling，2013 年）:

D_{\text{KL}}(\mathcal{N}(\mu,\sigma)||\mathcal{N}(0,1))=-\frac{1}{2}\sum_{j=1}^{J}(1+\log(\sigma_{j}^{2})-\mu_{j}^{2}-\sigma_{j}^{2}),

(31)

where $J$ denotes the dimensionality of $z$ . For each experiment, $J$ is set to twice the dimensionality of the action space. The KL divergence term in $\mathcal{L}_{\text{VAE}}$ is normalized across experiments by setting $\lambda=\frac{1}{2J}$ . During inference with the VAE, the latent vector $z$ is clipped to a range of $[-0.5,0.5]$ to limit generalization beyond previously seen actions. For the small scale experiments in Supplementary Material D.3, $L_{2}$ regularization with weight $10^{-3}$ was used for the VAE to compensate for the small number of samples. The other networks remain unchanged. In the value network update (Equation 27), the VAE is sampled multiple times and passed to both the perturbation model and each Q-network. For each state in the mini-batch the VAE is sampled $n=10$ times. This can be implemented efficiently by passing a latent vector with batch size $10$ $\cdot$ batch size, effectively $1000$ , to the VAE and treating the output as a new mini-batch for the perturbation model and each Q-network. When running the agent in the environment, we sample from the VAE $10$ times, perturb each action with $\xi_{\phi}$ and sample the highest valued action with respect to $Q_{\theta_{1}}$ .
其中 $J$ 表示 $z$ 的维度。对于每个实验， $J$ 设置为动作空间维度的两倍。在 $\mathcal{L}_{\text{VAE}}$ 中的 KL 散度项通过设置 $\lambda=\frac{1}{2J}$ 在实验间进行归一化。在使用 VAE 进行推断时，潜在向量 $z$ 被剪切到 $[-0.5,0.5]$ 的范围内，以限制超出先前观察到的动作的泛化。对于附录 D.3 中的小规模实验，VAE 使用带有权重 $10^{-3}$ 的 $L_{2}$ 正则化来补偿样本数量较少。其他网络保持不变。在值网络更新（方程 27）中，VAE 被多次采样并传递给扰动模型和每个 Q 网络。对于小批量中的每个状态，VAE 被采样 $n=10$ 次。通过传递具有批量大小 $10$ 的潜在向量 $\cdot$ 批量大小，有效地 $1000$ ，将输出视为扰动模型和每个 Q 网络的新小批量。在环境中运行代理时，我们从 VAE 中采样 $10$

DDPG. Our DDPG implementation deviates from some of the default architecture and hyper-parameters to mimic the original implementation more closely (Lillicrap et al., 2015). In particular, the action is only passed to the critic at the second layer (Figure 11), the critic uses $L_{2}$ regularization with weight $10^{-2}$ , and the actor uses a reduced learning rate of $10^{-4}$ .
DDPG。我们的 DDPG 实现与一些默认架构和超参数有所偏离，以更贴近原始实现（Lillicrap 等人，2015）。特别是，动作仅在第二层传递给评论家（图 11），评论家使用权重为 $10^{-2}$ 的 $L_{2}$ 正则化，而演员使用降低的学习率 $10^{-4}$ 。

(state dimension, 400) （状态维度，400） ReLU (400 + action dimension, 300)
（400 + 动作维度，300） RelU (300, 1)

Figure 11: DDPG Critic Network Architecture.
图 11：DDPG 评论者网络架构。

As done by Fujimoto et al. (2018), our DDPG agent randomly selects actions for the first 10k time steps for HalfCheetah-v1, and 1k time steps for Hopper-v1 and Walker2d-v1. This was found to improve performance and reduce the likelihood of local minima in the policy during early iterations.
根据 Fujimoto 等人（2018）的做法，我们的 DDPG 代理在 HalfCheetah-v1 的前 10k 个时间步和 Hopper-v1 以及 Walker2d-v1 的前 1k 个时间步中随机选择动作。发现这样做可以提高性能并减少在早期迭代中陷入局部最小值的可能性。

DQN. Given the high dimensional nature of the action space of the experimental environments, our DQN implementation selects actions over an independently discretized action space. Each action dimension is discretized separately into $10$ possible actions, giving $10J$ possible actions, where $J$ is the dimensionality of the action-space. A given state-action pair $(s,a)$ then corresponds to a set of state-sub-action pairs $(s,a_{ij})$ , where $i\in\{1,...,10\}$ bins and $j=\{1,...,J\}$ dimensions. In each DQN update, all state-sub-action pairs $(s,a_{ij})$ are updated with respect to the average value of the target state-sub-action pairs $(s^{\prime},a^{\prime}_{ij})$ . The learning update of the discretized DQN is as follows:
DQN。考虑到实验环境动作空间的高维特性，我们的 DQN 实现在独立离散化的动作空间中选择动作。每个动作维度分别离散化为 $10$ 个可能的动作，共 $10J$ 个可能的动作，其中 $J$ 是动作空间的维度。给定的状态-动作对 $(s,a)$ 对应于一组状态-子动作对 $(s,a_{ij})$ ，其中 $i\in\{1,...,10\}$ 个箱子和 $j=\{1,...,J\}$ 个维度。在每次 DQN 更新中，所有状态-子动作对 $(s,a_{ij})$ 都根据目标状态-子动作对 $(s^{\prime},a^{\prime}_{ij})$ 的平均值进行更新。离散化 DQN 的学习更新如下：

	$\displaystyle y$	$\displaystyle=r+\frac{\gamma}{n}\sum_{j=1}^{J}\max_{i}Q_{\theta^{\prime}}(s^{\prime},a_{ij}^{\prime}),$		(32)
	$\displaystyle\theta$	$\displaystyle\leftarrow\operatorname*{argmin}_{\theta}\sum_{(s,a,r,s^{\prime})\in\mathcal{B}}\sum_{j=1}^{J}\left(y-Q_{\theta}(s,a_{ij})\right)^{2},$		(33)

Where $a_{ij}$ is chosen by selecting the corresponding bin $i$ in the discretized action space for each dimension $j$ . For clarity, we provide the exact DQN network architecture in Figure 12.
通过在每个维度的离散动作空间中选择相应的箱子 $i$ 来选择 $a_{ij}$ 。为了清晰起见，我们在图 12 中提供了确切的 DQN 网络架构。

(state dimension, 400) （状态维度，400） ReLU (400, 300) RelU (300, 10 action dimension)
（300，10 个动作维度）

Figure 12: DQN Network Architecture.
图 12：DQN 网络架构。

Behavioral Cloning. We use two behavioral cloning methods, VAE-BC and BC. VAE-BC is implemented and trained exactly as $G_{\omega}(s)$ defined for BCQ. BC uses a feed-forward network with the default architecture and hyper-parameters, and trained with a mean-squared error reconstruction loss.
行为克隆。我们使用两种行为克隆方法，VAE-BC 和 BC。VAE-BC 被实现和训练，完全按照 BCQ 中定义的方式。BC 使用一个前馈网络，具有默认架构和超参数，并使用均方误差重构损失进行训练。

Off-Policy Deep Reinforcement Learning without Exploration无探索的离策深度强化学习

Off-Policy Deep Reinforcement Learning without Exploration: Supplementary Material无探索的离策深度强化学习：补充材料

Abstract 摘要

1 Introduction 介绍

2 Background 背景

3 Extrapolation Error 外推误差

3.1 Extrapolation Error in Deep Reinforcement Learning深度强化学习中的外推误差

4 Batch-Constrained Reinforcement Learning批次约束强化学习

4.1 Addressing Extrapolation Error in Finite MDPs在有限 MDP 中解决外推误差

4.2 Batch-Constrained Deep Reinforcement Learning4.2 批次约束深度强化学习

5 Experiments 5 实验

6 Related Work 相关工作

7 Conclusion 结论

References 参考资料

Appendix A Missing Proofs 附录 A 缺失的证明

A.1 Proofs and Details from Section 4.14.1 节的证明和细节

A.2 Sketch of the Proof of Convergence of Q-LearningQ 学习收敛证明草图

Appendix B Missing Graphs 附录 B 缺失的图表

B.1 Extrapolation Error in Deep Reinforcement Learning深度强化学习中的外推误差

B.2 Complete Experimental ResultsB.2 完整实验结果

Appendix C Extrapolation Error in Kernel-Based Reinforcement Learning基于核的强化学习中的外推误差

Appendix D Additional Experiments附录 D 附加实验

D.1 Ablation Study of Perturbation Model扰动模型消融研究

D.2 Uncertainty Estimation for Batch-Constrained Reinforcement Learning批次约束强化学习的不确定性估计

D.3 Random Behavioral Policy StudyD.3 随机行为策略研究

Appendix E Missing Background附录 E 缺少背景资料

E.1 Variational Auto-Encoder变分自动编码器

Appendix F Experimental Details附录 F 实验细节

Appendix G Implementation Details附录 G 实施细节