2024_05_16_cfaf4219a20e44d52963g

SARC: Soft Actor Retrospective Critic

Sukriti VermaCarnegie Mellon Universitysukritiv@andrew.cmu.edu

Ayush ChopraMIT Media Labayushc@mit.edu

Jayakumar SubramanianAdobejasubram@adobe.com

Mausoom SarkarAdobemsarkar@adobe.com

Nikaash PuriAdobenikpuri@adobe.com

Piyush GuptaAdobepiygupta@adobe.com

Balaji KrishnamurthyAdobekbalaji@adobe.com

Abstract

The two-time scale nature of SAC, which is an actor-critic algorithm, is characterised by the fact that the critic estimate has not converged for the actor at any given time, but since the critic learns faster than the actor, it ensures eventual consistency between the two. Various strategies have been introduced in literature to learn better gradient estimates to help achieve better convergence. Since gradient estimates depend upon the critic, we posit that improving the critic can provide a better gradient estimate for the actor at each time. Utilizing this, we propose Soft Actor Retrospective Critic (SARC), where we augment the SAC critic loss with another loss term - retrospective loss - leading to faster critic convergence and consequently, better policy gradient estimates for the actor. An existing implementation of SAC can be easily adapted to SARC with minimal modifications. Through extensive experimentation and analysis, we show that SARC provides consistent improvement over SAC on benchmark environments. We plan to open-source the code and all experiment data at https://github.com/sukritiverma1996/SARC
SAC 是一种行动者-批评者算法，具有双时间尺度的特点，即批评者的估计值在任何给定时间内都不会收敛于行动者，但由于批评者的学习速度比行动者快，因此可以确保两者之间的最终一致性。文献中提出了各种策略来学习更好的梯度估计值，以帮助实现更好的收敛。由于梯度估计值取决于批判者，因此我们认为，改进批判者可以在每次学习时为行动者提供更好的梯度估计值。利用这一点，我们提出了 "软行动者回溯批判"（SARC），即在 SAC 批判损失的基础上增加另一个损失项--回溯损失--从而加快批判收敛速度，进而为行动者提供更好的政策梯度估计。现有的 SAC 实现只需极少的修改就能轻松地适应 SARC。通过大量的实验和分析，我们表明在基准环境下，SARC 比 SAC 有持续的改进。我们计划将代码和所有实验数据开源，网址为 https://github.com/sukritiverma1996/SARC。

1 Introduction

The idea of applying and extending successful supervised learning techniques from the field of deep learning to the field of reinforcement learning has long existed and helped achieve remarkable success in deep RL over this decade. One of the early breakthroughs, Deep Q-Networks (DQN) [12, 13], leveraged advances from the field of deep learning to train an end-to-end deep reinforcement learning agent, which achieved better than human-level performance in several challenging Atari games. Before DQNs, reinforcement learning could only be applied to tasks having low-dimensional state spaces or tasks where a small number of features could be extracted. With DQNs, it became possible to learn policies directly from high-dimensional input. This subsequently gave rise to a plethora of deep RL algorithms that combined deep learning with reinforcement learning. Broadly, these algorithms fall into three classes - critic-only (like DQN, DDQN [23| etc.), actor-only (REINFORCE [26]) and actor-critic (SAC [6], PPO [16] etc.) algorithms, In this work, we seek to improve the Soft Actor-Critic Algorithm (SAC) [6], which is one of the current state-of-the-art actor-critic algorithms in Deep RL, by taking inspiration from a recent advance in the field of supervised deep learning Retrospective Loss.
将深度学习领域成功的监督学习技术应用和扩展到强化学习领域的想法由来已久，并在这十年间帮助深度 RL 取得了显著成就。早期的突破之一是深度 Q 网络（DQN）[12, 13]，它利用深度学习领域的进步训练了一个端到端的深度强化学习代理，在几个具有挑战性的雅达利游戏中取得了优于人类水平的表现。在使用 DQNs 之前，强化学习只能应用于具有低维状态空间的任务或可以提取少量特征的任务。有了 DQNs，就可以直接从高维输入中学习策略。随后，结合深度学习和强化学习的深度 RL 算法层出不穷。大体上，这些算法可分为三类--纯批判算法（如 DQN、DDQN [23| 等）、纯演员算法（REINFORCE [26]）和演员批判算法（SAC [6], PPO [16] 等）。

Actor-critic methods such as PPO, A3C [14], DPG [17], TD3 [5] and SAC have yielded great success in recent times [24, 2]. With actor-critic algorithms, the central idea is to split the learning into two key modules: i) Critic, which learns the policy-value function given the actor's policy. ii) Actor, which uses the critic's policy-value function to estimate a policy gradient and improve its policy.
最近，PPO、A3C [14]、DPG [17]、TD3 [5] 和 SAC 等行为者批判方法取得了巨大成功 [24, 2]。行为者批判算法的核心思想是将学习分成两个关键模块：i) 批判者，根据行为者的策略学习策略值函数；ii) 行为者，利用批判者的策略值函数估计策略梯度并改进自己的策略。

One iteration of policy improvement in actor-critic methods [10] involves two interwoven steps: one step of critic learning followed by one step of actor learning. These steps are interwoven for the following reason. Given a fixed critic, policy improvement and actor learning is determined by policy gradient estimates computed using the policy-value function provided by the critic. Given a fixed actor, the policy remains unchanged for the critic and hence, there exists a unique value function for the critic to learn. This allows the application of policy evaluation algorithms to learn this value function. A good value function, in turn enables accurate policy gradient estimates for the actor [18, 27].
行动者批判方法[10]中的政策改进迭代包括两个相互交织的步骤：一步批判者学习，然后是一步行动者学习。这些步骤相互交织的原因如下。在批判者固定的情况下，政策改进和行动者学习由使用批判者提供的政策值函数计算出的政策梯度估计值决定。在行为者固定的情况下，批判者的政策保持不变，因此批判者有一个唯一的价值函数可供学习。这就允许应用政策评估算法来学习这个价值函数。一个好的价值函数反过来又能为行动者提供准确的政策梯度估计 [18, 27]。

Several advances in actor-critic methods such as TRPO [15], PPO, A3C and DDPG [11] have come from improving actor learning and discovering more stable ways of updating the policy. A few actorcritic algorithms, such as TD3 and SAC, have advanced state-of-the-art by improving critic learning. Both of these algorithms learn two value functions rather than one to counter the overestimation bias [21] in value learning. TD3 updates the critic more frequently than the actor to minimize the error in the value function [5]. SAC uses the entropy of the policy as a regulariser [6]. In this work, we propose to further improve critic learning in SAC by applying a regulariser that accelerates critic convergence.
行动者批判方法的一些进步，如 TRPO [15]、PPO、A3C 和 DDPG [11]，都来自于改进行动者学习和发现更稳定的策略更新方法。一些行动者批评算法，如 TD3 和 SAC，通过改进批评者学习达到了最先进的水平。这两种算法都学习两个而非一个价值函数，以消除价值学习中的高估偏差 [21]。TD3 比行动者更频繁地更新批判者，以最小化价值函数的误差[5]。SAC 使用策略熵作为正则调节器 [6]。在这项工作中，我们建议通过应用可加速批判者收敛的正则器来进一步改进 SAC 中的批判者学习。

This regulariser is inspired from a recent technique that improves performance in the supervised learning setting called retrospective loss [8]. When retrospective loss is minimized along with the task-specific loss, the parameters at the current training step are guided towards the optimal parameters while being pulled away from the parameters at a previous training step. In the supervised setting, retrospective loss results in better test accuracy.
这种正则调节器的灵感来自于最近一种在监督学习环境中提高性能的技术--回溯损失[8]。当追溯损失与特定任务损失一起最小化时，当前训练步骤的参数就会被引导向最优参数，同时远离上一个训练步骤的参数。在有监督的情况下，回溯损失会带来更好的测试精度。

Using this loss as a retrospective regulariser applied to the critic, accelerates the convergence of the critic. Due to the two time-scale nature of actor-critic algorithms [3, 10], causing the critic to learn faster leads to better policy gradient estimates for the actor, as gradient estimates depend on the value function learned by the critic. Better policy gradient estimates for the actor in turn lead to better future samples.
将这一损失作为追溯正则应用于批判者，可以加快批判者的收敛速度。由于行动者批判算法具有两个时间尺度的特性 [3，10]，使批判者更快地学习会为行动者带来更好的策略梯度估计，因为梯度估计取决于批判者学习到的值函数。对行动者来说，更好的政策梯度估计反过来又会带来更好的未来样本。

Bringing it all together, in this work, we propose a novel actor-critic method, Soft Actor Retrospective Critic (SARC), which improves the existing Soft Actor-Critic (SAC) method by applying the aforementioned retrospective regulariser on the critic. We perform extensive empirical demonstration and analysis to show that SARC leads to better actors in terms of sample complexity as well as final return in several standard benchmark RL tasks provided in the DeepMind Control Suite and PyBullet Environments. We also show how SARC does not degrade performance for tasks where SAC already achieves optimal performance. To help reproducibility and to help convey that it takes minimal editing and essentially no added computation cost to convert an existing implementation of SAC to SARC, we open-source the code and all experiment data at https://github.com/sukritiverma1996/SARC
综上所述，在这项工作中，我们提出了一种新颖的行为体批判方法--软行为体回顾批判（Soft Actor Retrospective Critic，SARC），它通过在批判者身上应用上述回顾正则器，改进了现有的软行为体批判（Soft Actor-Critic，SAC）方法。我们进行了大量的实证演示和分析，表明在 DeepMind Control Suite 和 PyBullet 环境中提供的几个标准基准 RL 任务中，SARC 在样本复杂度和最终回报方面都能带来更好的行为者。我们还展示了 SARC 如何在 SAC 已达到最佳性能的任务中不降低性能。为了提高可重复性，并说明将 SAC 的现有实现转换为 SARC 只需极少的编辑工作，基本上不会增加计算成本，我们将代码和所有实验数据开源到 https://github.com/sukritiverma1996/SARC。

Out of the three broad classes of RL algorithms, critic-only methods can efficiently learn an optimal policy [9] (implicitly derived from the optimal action value function) under tabular setting. But under continuous or stochastic settings, function approximation can cause critic-only methods to diverge [25]. Actor-only methods can be applied to these settings but they suffer from high variance when computing the policy gradient estimates. Actor-critic algorithms combine the advantages of critic-only methods and actor-only methods. The key difference between actor-critic and actor-only algorithms is that actor-critic algorithms estimate an explicit critic or policy-value function instead of the Monte Carlo return estimated by the actor-only algorithms. In the sequel, we restrict attention to actor-critic algorithms. A more detailed comparison can be found in Sutton and Barto [18].
在三大类 RL 算法中，纯批判方法可以在表格设置下有效地学习最优策略 [9]（隐含来自最优行动值函数）。但在连续或随机设置下，函数近似会导致纯批判方法出现偏差 [25]。纯行动者方法可以应用于这些环境，但在计算策略梯度估计值时，它们会受到高方差的影响。行为批判算法结合了纯批判方法和纯行为方法的优点。演员批判算法与纯演员算法的主要区别在于，演员批判算法估算的是明确的批判者或政策值函数，而不是纯演员算法估算的蒙特卡罗返回值。在接下来的文章中，我们将只关注行动者批判算法。更详细的比较可参见 Sutton 和 Barto [18]。

Methods that model an explicit policy, such as actor-only and actor-critic algorithms, consist of two steps. i) Policy evaluation: Calculating the policy-value function for the current policy, and ii) Policy improvement: Using the value function to learn a better policy.
对明确政策进行建模的方法，如纯行动者算法和行动者批判算法，包括两个步骤： i) 政策评估：计算当前政策的政策价值函数，以及 ii) 政策改进：利用价值函数学习更好的政策。

It is customary in RL to run these steps simultaneously utilizing the two time-scale stochastic approximation algorithm [3, 19, 10] rather than running these steps to convergence individually. This is the approach used in actor-critic algorithms, where the critic learns the policy-value function given the actor's policy and the actor improves it's policy guided by policy gradient estimates computed using the policy-value function provided by the critic.
通常情况下，RL 采用双时间尺度随机逼近算法 [3, 19, 10] 同时运行这些步骤，而不是单独运行这些步骤以达到收敛。这就是行动者-批评者算法中使用的方法，批评者根据行动者的策略学习策略值函数，而行动者则在利用批评者提供的策略值函数计算出的策略梯度估计值的指导下改进自己的策略。

RL algorithms can be on-policy or off-policy. Off-policy actor-critic algorithms like Deep Deterministic Policy Gradient (DDPG) [11] and Twin Delayed DDPG (TD3) [5] often have a better sample complexity than on-policy algorithms. DDPG is a deep variant of the deterministic policy gradient algorithm [17]. It combines the actor-critic approach with insights from DQN. However, DDPG suffers from an overestimation bias [21] in learning the Q-values. This hinders policy improvement. TD3 addresses this by learning two Q-functions and using the smaller of the two to compute the policy gradient. This favors underestimations. TD3 also proposes delaying policy updates until the value estimates have converged.
RL 算法可以是政策内算法，也可以是政策外算法。深度确定性策略梯度算法（Deep Deterministic Policy Gradient，DDPG）[11] 和双延迟 DDPG 算法（Twin Delayed DDPG，TD3）[5] 等非策略行动者批判算法通常比策略上算法具有更好的样本复杂度。DDPG 是确定性策略梯度算法的深度变体[17]。它结合了行为批判方法和 DQN 的见解。然而，DDPG 在学习 Q 值时存在高估偏差 [21]。这阻碍了策略的改进。TD3 通过学习两个 Q 函数并使用其中较小的一个来计算策略梯度来解决这一问题。这有利于低估。TD3 还建议延迟策略更新，直到值估计收敛为止。

On-policy algorithms provide better stability but are not as sample efficient and often employ replay buffers to improve sample efficiency. Trust Region Policy Optimization (TRPO) [15] proposes updating the policy by optimizing a surrogate objective, constrained on the KL-divergence between the previous policy and the updated policy. Actor Critic using Kronecker-Factored Trust Region (ACKTR) [28] uses kronecker-factored approximated curvature to perform a similar trust region update. Proximal Policy Optimization (PPO) [16] proposes to match the performance of TRPO with only first-order optimization.
策略上算法具有更好的稳定性，但采样效率较低，通常采用重放缓冲区来提高采样效率。信任区域策略优化（TRPO）[15] 建议通过优化代用目标来更新策略，该代用目标受限于先前策略与更新策略之间的 KL-发散。使用克朗克尔因子信任区域的行为批判（ACKTR）[28] 使用克朗克尔因子近似曲率来执行类似的信任区域更新。近端策略优化（PPO）[16] 建议仅用一阶优化来匹配 TRPO 的性能。

Soft Actor-Critic (SAC) [6] is an off-policy algorithm that has both the qualities of sample efficiency and stability. SAC optimizes a stochastic policy with entropy regularization. The policy is trained to maximize a trade-off between expected return and entropy. Increasing entropy results in more exploration, which can help accelerate learning. It can also prevent the policy from converging to a local optimum.
软行为批判算法（SAC）[6] 是一种非策略算法，兼具样本效率和稳定性。SAC 通过熵正则化优化随机策略。该策略的训练目的是最大化预期收益和熵之间的权衡。增加熵会带来更多的探索，这有助于加速学习。它还能防止策略收敛到局部最优。

Our contribution In this work, we propose Soft Actor Restrospective Critic (SARC), an algorithm that improves SAC by applying a retrospective regulariser on the critic inspired from retrospective loss [8]. We empirically validate this claim by comparing and analysing SARC with SAC, TD3 and DDPG on multiple continuous control environments.
我们的贡献在这项工作中，我们提出了软代理回顾性批判（Soft Actor Restrospective Critic，SARC），这是一种受回顾性损失（retrospective loss）[8]启发，通过在批判器上应用回顾性正则器来改进 SAC 的算法。我们通过在多个连续控制环境中比较和分析 SARC 与 SAC、TD3 和 DDPG，实证验证了这一说法。

3 Soft Actor Retrospective Critic

3.1 Premise

In actor-critic methods, we note that the actor learning algorithm incrementally improves its policy, using the critic's performance estimate, while the best performance is unknown. However, in contrast to this, the critic learning algorithm is a supervised learning one albeit with a moving target. This is the key observation that enables us to import retrospective loss from supervised learning to actor-critic methods.
我们注意到，在演员批评法中，演员学习算法利用批评者的性能估计值逐步改进其策略，而最佳性能是未知的。然而，与此不同的是，批评者学习算法是一种监督学习算法，尽管目标是移动的。这正是我们能够将监督学习中的追溯损失引入演员批评方法的关键所在。

Retrospective loss and critic learning We begin by describing our notation. Let

denote an input and output (ground truth) training data point, with

denoting the size of the training data and

belonging to the spaces

and

respectively, where

are the respective dimensions of the input and output. Furthermore, let

denote a mapping from the input space to the output space, parameterized by

, where

denotes the number of parameters in this mapping. Let

denote the predicted output for the data point

using parameter

. The supervised learning objective is to learn a parameter

, which minimizes the error between the ground truth output

and the predicted output

for all

. Various metrics can be used to represent this loss, but for our illustration, we choose a loss based on the

metric. We call this loss as the original loss,

, which is defined as:
回溯损失和批判学习我们首先介绍一下我们的符号。让

表示输入和输出（地面实况）训练数据点，

表示训练数据的大小，

分别属于空间

和

，其中

分别是输入和输出的维度。此外，让

表示从输入空间到输出空间的映射，参数为

，其中

表示该映射的参数数。让

表示使用参数

预测的数据点

的输出。监督学习的目标是学习一个参数

，使所有

的地面实况输出

与预测输出

之间的误差最小。可以使用多种指标来表示这种损失，但在我们的示例中，我们选择了基于

指标的损失。我们将这种损失称为原始损失，

，其定义为：

We paraphrase the definition of retrospective loss from [8], and modify it slightly as follows:
我们转述了 [8] 中关于追溯损失的定义，并稍作修改如下：

Definition 1. [adapted from [8]] Retrospective loss is defined in the context of an iterative parameter estimation scheme and is a function of the parameter values at two different iterations (times)-the current iteration and a previous iteration. Let

denote the current and previous times respectively, and we so have

. Then, the retrospective loss at time

, is comprised of two components - the original loss and the retrospective regularizer as given below:
定义 1.[改编自[8]]回溯损失是在迭代参数估计方案中定义的，是两个不同迭代（时间）--当前迭代和之前迭代--的参数值的函数。让

分别表示当前时间和前一次时间，因此我们有

。那么，时间

时的回溯损失由两部分组成--原始损失和回溯正则器，如下所示：

where the retrospective regularizer

is defined as:

where

and

is any metric on vector spaces, such as

. We term the family of functions as given in (2), defined by parameters

and the metric

as retrospective loss functions.
其中

和

是向量空间上的任意度量，如

。我们将参数

和度量

定义的函数族称为追溯损失函数，如 (2) 所示。

When retrospective loss is minimized along with the task-specific loss, the parameters at the current training step are guided towards the optimal parameters while being pulled away from the parameters at a previous training step. In the supervised setting, retrospective loss results in better test accuracy.
当追溯损失与特定任务损失一起最小化时，当前训练步骤中的参数就会被引导到最佳参数，同时与前一个训练步骤中的参数拉开距离。在有监督的情况下，回溯损失会带来更高的测试精度。

We posit that accelerating critic learning can help actor realise better gradient estimates which results in improved policy learning. In lieu with this goal, we propose that using the retrospective loss as a regularizer in the critic objective will accelerate critic learning. In Section 5, we present experiments that demonstrate improved policy learning for the actor achieved with the use of retrospective loss as a regularizer in the critic objective. Furthermore, in Section 6 we observe that even for slowly moving targets (ground truths), as is the case for the critic, retrospective loss yields convergence acceleration.
我们认为，加速批判学习可以帮助行动者实现更好的梯度估计，从而改进政策学习。为了实现这一目标，我们建议在批判目标中使用回顾损失作为正则因子，以加速批判学习。在第 5 节中，我们通过实验证明，使用回顾损失作为批判目标中的正则因子，可以改善行动者的政策学习。此外，在第 6 节中，我们还观察到，即使是对于缓慢移动的目标（地面实况），就像批判者一样，追溯损失也能加速收敛。

3.2 Formalising SARC

As the popular Soft Actor-Critic (SAC) algorithm offers both the qualities of sample efficient learning as well as stability [6] (Section[2), we therefore, specifically extend SAC in this work. In principle, retrospective regularization can be incorporated with other actor critic methods as well and would be an interesting direction for future exploration. Our algorithm is summarized in Algorithm 1 where the basic SAC code is reproduced from SpinningUp [1] and the additions are highlighted in blue. From an algorithmic perspective, SARC is a minimal modification (3 lines) over SAC.
由于流行的软行为批判（Soft Actor-Critic，SAC）算法同时具有样本高效学习和稳定性的特点[6]（第[2]节），因此我们在本研究中特别扩展了软行为批判算法。原则上，回溯正则化也可以与其他行为批判方法相结合，这将是未来探索的一个有趣方向。算法 1 总结了我们的算法，其中 SAC 的基本代码来自 SpinningUp [1]，新增部分用蓝色标出。从算法角度来看，SARC 是对 SAC 的最小修改（3 行）。

Learning in SARC Here we reproduce the basic SAC actor and critic learning algorithms [6, 1] with the retrospective regularizer added to the critic loss, which yields the SARC algorithm. The update equations are as given below. The actor is represented by a function

, parametrized with

. Given any state

, the action

is sampled as:
SARC 中的学习在这里，我们重现了基本的 SAC 表演者和批判者学习算法[6, 1]，并在批判者损失中加入了回溯正则，从而产生了 SARC 算法。更新方程如下。行动者由函数

表示，参数为

。给定任何状态

，行动

的采样为

The targets for the two critics in SARC are then computed as in SAC:
然后，SARC 中两个批评者的目标计算方法与 SAC 相同：

where

is the target,

is the per-step reward function,

is the done signal,

are the critic parameters, and

are the corresponding critic parameters for the target. The loss for the critic is then estimated as:
其中，

是目标，

是每步奖励函数，

是完成信号，

是批判者参数，

是目标的相应批判者参数。批判者的损失估计为

where

is given by [3). And finally, the actor is updated as in the original SAC algorithm.
其中

由 [3] 给出。最后，与最初的 SAC 算法一样，对演员进行更新。

4 Experimental Setup

To demonstrate that SARC outperforms SAC and is competitive with existing state-of-the-art actorcritic methods such as TD3 and DDPG, we conduct an exhaustive set of experiments. For all of our
为了证明 SARC 的性能优于 SAC，并能与 TD3 和 DDPG 等现有最先进的演员批评方法相媲美，我们进行了一系列详尽的实验。对于我们所有的

Algorithm 1 Soft Actor Retrospective Critic
    Input: initial policy parameters \(\theta, \mathrm{Q}\)-function parameters \(\phi_{1}, \phi_{2}\), empty replay buffer \(\mathcal{D}\)
    Set target parameters equal to main parameters \(\phi_{\text {targ }, 1} \leftarrow \phi_{1}, \phi_{\text {targ }, 2} \leftarrow \phi_{2}\)
    Previous Q-function parameters set to initial
    \(\phi_{\text {prev }, 1} \leftarrow \phi_{1}, \phi_{\text {prev }, 2} \leftarrow \phi_{2}\)
    repeat
        Observe state \(s\), sample \(a \sim \pi_{\theta}(. \mid s)\)
        Execute \(a\) in environment
        Observe next state \(s^{\prime}\), reward \(r\), and done signal \(d\) to indicate whether \(s^{\prime}\) is terminal
        Store \(\left(s, a, s^{\prime}, r, d\right)\) in replay buffer \(D\)
        If \(d=1\) then \(s^{\prime}\) is terminal, reset environment
        if it's time to update then
            for \(j\) in range numUpdates do
                Sample batch, \(B=\left(s, a, s^{\prime}, r, d\right)\) from \(D\)
                Compute targets for Q-functions

14:

Update Q-functions

15:

Update policy

where

is a sample from

which is differentiable wrt

via the reparametrization trick.
其中，

是来自

的样本，通过重拟态技巧，它与

是可微分的。

Update target Q-functions

Update previous Q-functions

end for

end if

until convergence
(a) Walker-Stand

(d) Cheetah-Run

(b) Walker-Walk

(e) Reacher-Easy

(f) Reacher-Hard

Figure 1: Results for SARC (red), SAC (green), TD3 (purple) and DDPG (blue) across 6 DeepMind Control Suite Tasks. The

-axis shows the timesteps in each environment. The

-axis shows the average value and standard deviation band of return across 5 seeds (10 test episode returns per seed). It can be observed that SARC improves SAC in all environments. SARC also outperforms TD3 and DDPG in most environments.
图 1：6 个 DeepMind 控制套件任务中的 SARC（红色）、SAC（绿色）、TD3（紫色）和 DDPG（蓝色）结果。

轴显示了每个环境中的时间步数。

轴显示了 5 个种子（每个种子有 10 个测试集返回）返回的平均值和标准偏差带。可以看出，SARC 在所有环境中都改善了 SAC。在大多数环境中，SARC 也优于 TD3 和 DDPG。

experiments, we use the implementations provided in SpinningUp [1]. SpinningUp is an open-source deep RL library.
实验中，我们使用 SpinningUp [1] 中提供的实现。SpinningUp 是一个开源的深度 RL 库。

We modify the original implementation of SAC, provided in SpinningUp, to SARC. Modifying an existing SAC implementation to SARC is minimal and straightforward, requiring addition of a few lines and essentially no compute as presented in Algorithm 1 Within SARC critic loss, for the retrospective regularizer, we use

across all tasks, environments and analyses.
我们将 SpinningUp 中提供的 SAC 原始实现修改为 SARC。将现有的 SAC 实现修改为 SARC 非常简单直接，只需添加几行代码，基本上不需要计算，如算法 1 所示在 SARC 的批评损失中，对于追溯正则器，我们在所有任务、环境和分析中使用

。

We conduct our evaluations and analysis across various continuous control tasks. Concretely, we use 6 tasks provided in the DeepMind Control Suite [20] and 5 tasks provided in PyBullet Environments [4]. For all experiments, we retain all of the default hyperparameters specified originally in the SpinningUp library and let each agent train for 2 million timesteps in the given environment. In our experiments, the same hyperparameters worked for all environments and tasks. The actor and critic network size used in all experiments is [256, 256], unless specified otherwise. This is the default value for the network size specified originally in the SpinningUp library. All models were trained using CPUs on a Linux machine running Ubuntu 16.04 with 2nd Generation Intel Xeon Gold Processor.
我们对各种连续控制任务进行了评估和分析。具体来说，我们使用了 DeepMind Control Suite [20] 中提供的 6 项任务和 PyBullet Environments [4] 中提供的 5 项任务。在所有实验中，我们保留了 SpinningUp 库中最初指定的所有默认超参数，并让每个代理在给定环境中训练 200 万个时间步。在我们的实验中，相同的超参数适用于所有环境和任务。除非另有说明，所有实验中使用的角色和批判者网络大小均为 [256, 256]。这是 SpinningUp 库中最初指定的网络大小的默认值。所有模型都是在运行 Ubuntu 16.04 和第二代英特尔至强黄金处理器的 Linux 机器上使用 CPU 训练的。

In Section 5, we compare our proposed algorithm SARC to SAC, TD3 and DDPG across 6 tasks provided in the DeepMind Control Suite. In Section6, we compare retrospective loss with an alternate baseline technique for critic acceleration and analyse our hypothesis of faster critic convergence. We also compare SARC with SAC under various modified experimental settings to further establish our claim. We make all of our code available at: https://github.com/sukritiverma1996/SARC. We also make all of our data used to construct the graphs and report the figures available at: https : ///github.com/sukritiverma1996/SARC
在第 5 节中，我们在 DeepMind Control Suite 提供的 6 个任务中将我们提出的算法 SARC 与 SAC、TD3 和 DDPG 进行了比较。在第 6 节中，我们将回溯损失与另一种批判者加速基线技术进行了比较，并分析了我们关于批判者收敛速度更快的假设。我们还在各种修改后的实验设置下比较了 SARC 和 SAC，以进一步证实我们的假设。我们在 https://github.com/sukritiverma1996/SARC 网站上提供了所有代码。我们还将用于构建图表和报告数字的所有数据公布在：https ：///github.com/sukritiverma1996/SARC

5 Results

In this section, we present comparisons between SARC, SAC, TD3 and DDPG. We perform these evaluations on 6 DeepMind Control Suite tasks: Cheetah-Run, Finger-Spin, Walker-Walk, WalkerStand, Reacher-Easy, Reacher-Hard. To ensure fairness, consistency and, to rule out outlier behavior,
在本节中，我们将对 SARC、SAC、TD3 和 DDPG 进行比较。我们在 6 项 DeepMind Control Suite 任务上进行了评估：我们在 6 个 DeepMind 控制套件任务中进行了这些评估：Cheetah-Run、Finger-Spin、Walker-Walk、WalkerStand、Reacher-Easy、Reacher-Hard。为了确保公平性和一致性，并排除离群行为，我们对 6 个 DeepMind 控制套件任务进行了评估、

Table 1: Test episode returns of SARC, SAC, TD3 and DDPG on 6 DeepMind Control Suite tasks after training for

steps.
表 1：经过

步训练后，SARC、SAC、TD3 和 DDPG 在 6 个 DeepMind 控制套件任务上的测试集回报。

Environment	Average Test Episode Return
Environment	TD3	DDPG	SAC	SARC
Walker-Stand	959.48	716.99	697.33
Walker-Walk	797.87	770.08	886.28
Finger-Spin	827.02	179.3	315.28
Cheetah-Run		500.52	8.22	382.69
Reacher-Easy	799.62	913.92
Reacher-Hard	771.06	497.4

we run five trials for each experiment presented in this paper, each with a different preset random seed. Each experiment uses the same set of 5 seeds.
我们在本文介绍的每个实验中都进行了五次试验，每次试验都使用不同的预设随机种子。每个实验都使用同一组 5 个种子。

Table

shows the average value across 5 seeds ( 10 test episode returns per seed) obtained for each of the 6 tasks after training for 2 million timesteps. It can be observed that, at the end of training, SARC outperforms or is at par with existing approaches on 5 out of 6 environments presented in bold.
表

显示了 6 项任务中每项任务在训练 200 万个时间步后，5 个种子（每个种子 10 个测试集返回）的平均值。可以看出，在训练结束时，SARC 在黑体显示的 6 个环境中的 5 个环境中的表现优于或与现有方法相当。

In figure 1] we present the mean Monte Carlo returns over 10 test episodes at various steps during training. The

-axis shows the timesteps in each environment and the

-axis shows the mean and standard deviation band of above specified returns across 5 seeds.
在图 1 中，我们展示了在训练过程中不同阶段 10 次测试的蒙特卡罗平均收益率。

轴显示了每个环境中的时间步数，

轴显示了上述指定收益在 5 个种子中的平均值和标准偏差带。

It can be observed that, on Finger-Spin, Walker-Stand and Walker-Walk, SARC achieves a higher return value faster than any other algorithm. On Reacher-Easy and Reacher-Hard, SARC overlaps with SAC and continues to provide improvement over TD3 and DDPG. SARC substantially improves SAC, particularly on Walker-Stand, Walker-Walk, Cheetah-Run, and Finger-Spin. To ensure and demonstrate that SARC provides a consistent improvement over SAC even under modified experimental settings, we restrict further analyses to these 4 environments. The experiments as described in this section have also been performed on 5 tasks provided in PyBullet Environments [4]. These can be found in Appendix A.
可以看出，在Finger-Spin、Walker-Stand和Walker-Walk上，SARC比其他算法更快地获得了更高的返回值。在Reacher-Easy和Reacher-Hard上，SARC与SAC重叠，并继续超越TD3和DDPG。SARC 大大改进了 SAC，尤其是在 Walker-Stand、Walker-Walk、Cheetah-Run 和 Finger-Spin 上。为了确保并证明 SARC 即使在修改后的实验设置下也能持续改进 SAC，我们将进一步的分析限制在这 4 种环境中。本节所述实验还在 PyBullet 环境[4]中提供的 5 个任务上进行。这些任务可在附录 A 中找到。

6 Discussion

First, we explore the effect of alternate baseline strategy to accelerate critic convergence in SAC and compare it with SARC. Next, we present evidence of accelerated critic convergence in SARC. Finally, we demonstrate that SARC significantly outperforms SAC for increased network sizes. We note that, for ease of exposition, we restrict the comparative scope of this analysis to SAC, which is the most similar baseline and also learns a soft critic.
首先，我们探讨了在 SAC 中加速批判者收敛的替代基线策略的效果，并将其与 SARC 进行了比较。接着，我们提出了 SARC 中加速批判者收敛的证据。最后，我们证明 SARC 在网络规模增大时明显优于 SAC。我们注意到，为了便于阐述，我们将本分析的比较范围限制在 SAC 上，因为 SAC 是最相似的基线，也学习软批判器。

(a) Finger-Spin

(b) Cheetah-Run

Figure 2: Comparing SARC and delayed SAC. The

-axis shows the timesteps in each environment. The

-axis shows the average value and standard deviation band of return across 5 seeds ( 10 test episode returns per seed). Compared to delayed SAC (yellow), SARC (red) provides a more reliable and potent improvement over SAC.
图 2：比较 SARC 和延迟 SAC。

轴显示了每个环境中的时间步长。

轴显示了 5 个种子的平均返回值和标准偏差带（每个种子有 10 个测试集返回值）。与延迟 SAC（黄色）相比，SARC（红色）提供了更可靠、更有效的改进。
(a) Finger-Spin,

(d) Finger-Spin,

(b) Walker-Stand,

(e) Walker-Stand,

(f) Walker-Walk,

Figure 3: Results for SARC (red) and SAC on tuning the entropy regularization coefficient,

. The

-axis shows the timesteps in each environment. The

-axis shows the average value and standard deviation band of return across 5 seeds ( 10 test episode returns per seed). It can be seen that SARC outperforms SAC across different values of

.
图 3：SARC（红色）和 SAC 调整熵正则化系数

的结果。

轴显示了每个环境中的时间步长。

轴显示 5 个种子（每个种子有 10 个测试集返回）返回的平均值和标准偏差带。可以看出，在不同的

值下，SARC 的表现优于 SAC。

6.1 Increasing critic update frequency for SAC
6.1 增加对 SAC 的评论更新频率

An alternate baseline strategy to accelerate convergence of the critic relative to the actor can be to update critic more frequently. A similar idea has been proposed in TD3 [5], where the critic updates more frequently than the actor.
另一种加速批判者相对于行动者收敛的基线策略是更频繁地更新批判者。TD3 [5]中也提出了类似的想法，即批判者比行动者更新得更频繁。

In this section, we compare SARC against a version of SAC where the critic updates twice as frequently as the actor. We term this as "delayed SAC". We do this for three environments: FingerSpin, Cheetah-Run and Walker-Stand. Figure 2 has the results of this investigation. The x-axis shows the timesteps in each environment and the

-axis shows the mean and standard deviation band of Mean Monte Carlo returns over 10 test episodes across 5 seeds. It can be observed that while delayed SAC provides some improvement over original SAC, it is not as reliable and not as potent as the improvement provided by SARC.
在本节中，我们将 SARC 与 SAC 的一个版本进行比较，在这个版本中，评论家的更新频率是演员的两倍。我们称之为 "延迟 SAC"。我们对三种环境进行了比较：分别是 "手指旋转"、"猎豹奔跑 "和 "步行者站立"。图 2 是这项研究的结果。x 轴显示了每个环境中的时间步数，

- 轴显示了 5 个种子 10 次测试的平均蒙特卡洛回报的平均值和标准偏差带。可以看出，虽然延迟 SAC 比原始 SAC 有一定的改进，但其可靠性和有效性都不如 SARC。

6.2 Modifying entropy regularization coefficient

In this section, we explore the effect of modifying the entropy regularization coefficient,

in SAC and SARC. It may be the case that the significant performance gain that SARC is offering over SAC may disappear if the entropy regularizer,

, is tuned.
在本节中，我们将探讨修改 SAC 和 SARC 中的熵正则化系数

的效果。如果调整熵正则化系数

，SARC 相对于 SAC 的显著性能提升可能会消失。

All experiments presented in Section [5 had value of

. In this section, we look at two more values of

and 0.4. All other hyperparameters are held at fixed values as described in Section 4 We do this for three environments: Finger-Spin, Walker-Stand and Walker-Walk. Figure 3 has the results of this investigation. Mean Monte Carlo returns over 10 test episodes at various steps during training have been plotted. The

-axis shows the timesteps in each environment and the

-axis shows the mean and standard deviation band of above specified returns across 5 seeds. It can be observed that SARC continues to provide consistent improvement over SAC across values of

.
第[5]节中介绍的所有实验的值都是

。在本节中，我们将研究另外两个值

和 0.4。如第 4 节所述，所有其他超参数都保持固定值：我们对三种环境进行了研究："手指旋转"、"步行者站立 "和 "步行者行走"。图 3 是这一研究的结果。图中显示了在训练过程中不同阶段 10 次测试的蒙特卡洛平均收益。

轴显示了每个环境中的时间步数，

轴显示了上述指定回报在 5 个种子中的平均值和标准偏差带。可以看出，与 SAC 相比，SARC 在

的各种值上都有持续的改进。

6.3 Tracking SARC critic convergence with respect to Monte Carlo policy evaluation
6.3 跟踪 SARC 评论者收敛与蒙特卡罗政策评估

We train one SAC agent and one SARC agent for 2 million timesteps. During training, we save the intermediate actor and critic at every 10,000 timesteps. Given an intermediate actor-critic pair, we compute the mean squared difference between Monte Carlo estimate of the return of the current policy and

learned by the critic for that policy. This difference is termed as "Q-Error". If
我们对一个 SAC 代理和一个 SARC 代理进行了 200 万个时间步的训练。在训练过程中，我们每 10,000 个时间步保存一次中间行动者和批评者。给定一对中间行动者和批判者，我们计算蒙特卡洛估计的当前策略收益与批判者为该策略所学的

之间的均方差。这个差值被称为 "Q-误差"。如果
(a) Finger-Spin

(b) Cheetah-Run

Figure 4: Tracking critic convergence with respect to Monte Carlo policy evaluation across training. The

-axis shows the timesteps in each environment. The

-axis shows the mean and standard deviation band of "Q-Error" across 5 seeds. It can be seen that SARC (red) critic converges faster.
图 4：在整个训练过程中，蒙特卡洛策略评估对批评者收敛性的跟踪。

轴显示每个环境中的时间步数。

轴显示了 5 个种子的 "Q-Error "平均值和标准偏差带。可以看出，SARC（红色）批评者的收敛速度更快。

the critic value function has converged to the correct value for a given policy, then this difference should be zero. The idea is to track and compare how fast the critic converges.
如果批判者的价值函数已经收敛到给定政策的正确值，那么这个差值就应该为零。我们的想法是跟踪和比较批判者的收敛速度。

We do this for three environments: Finger-Spin, Cheetah-Run, Walker-Stand. Figure 4 has the results of this investigation. The average "Q-Error" over 5 seeds has been plotted. The graphs also show the standard deviation band across the 5 random seeds. It can be seen that SARC critic converges faster than the SAC critic on Finger-Spin and Walker-Stand.
我们在三种环境中都这样做：手指旋转、猎豹奔跑和步行者站立。图 4 是这次调查的结果。图中显示了 5 个种子的平均 "Q 错误"。图中还显示了 5 个随机种子的标准偏差带。可以看出，在 Finger-Spin 和 Walker-Stand 算法上，SARC 算法的收敛速度快于 SAC 算法。

6.4 Increasing actor and critic network complexity
6.4 行动者和评论者网络复杂性的增加

It may be possible that SAC can perform as well as SARC with a bigger network. In this section, we compare SAC and SARC on tasks from DeepMind Control Suite after increasing the network sizes for actor and critic to [400, 300]. This is another commonly found architecture in literature relating to continuous control [7]. In our previous experiments, we were using network size [256, 256].
在网络规模更大的情况下，SAC 有可能与 SARC 一样出色。在本节中，我们将把行动者和批评者的网络大小增加到 [400, 300]，然后在 DeepMind Control Suite 的任务中比较 SAC 和 SARC。这是另一种常见的连续控制相关文献架构[7]。在之前的实验中，我们使用的是网络大小 [256, 256]。

In figure 5. we present the mean Monte Carlo returns over 10 test episodes at various steps during training. The

-axis shows the timesteps in each environment and the

-axis shows the mean and standard deviation band of above specified returns across 5 seeds. It can be observed that even on increasing the network sizes, SARC continues to provide consistent improvement in episode return value over SAC. The experiments as described in this section have also been performed on 5 tasks provided in PyBullet Environments [4]. These can be found in Appendix B.
在图 5 中，我们展示了在训练过程中不同阶段 10 次测试的蒙特卡罗平均收益率。

轴显示了每个环境中的时间步数，

轴显示了上述指定回报在 5 个种子中的平均值和标准偏差带。可以看出，即使增加网络规模，SARC 仍比 SAC 持续改进集数回报值。本节所述实验也是在 PyBullet 环境[4]中提供的 5 个任务上进行的。这些任务可在附录 B 中找到。

7 Conclusion and Future Work

In this work, we presented Soft Actor Retrospective Critic (SARC). SARC builds on and improves SAC by applying a retrospective regulariser on the critic. This retrospective regulariser is inspired from retrospective loss, a recent technique that improves accuracy in the supervised setting. We proposed that using the retrospective loss as a regularizer in the critic objective accelerates critic learning. Due to the two-time scale nature of actor-critic algorithms, accelerating critic convergence can provide a better gradient estimate for the actor throughout training and improve final performance. We validated our claims of improvement in final performance through extensive experimental evaluation and analysis on multiple continuous control environments. We also compared retrospective loss with an alternate baseline technique for critic acceleration. We presented evidence of critic acceleration in SARC. We also tuned the SAC entropy regularizer,

to demonstrate that SARC continues providing performance gain across different values of

. As a future direction, we aim to extend retrospective regularization to other actor-critic algorithms.
在这项工作中，我们提出了软行为者回顾性批判（Soft Actor Retrospective Critic，SARC）。SARC 建立在 SAC 的基础上，并通过对批判者应用追溯正则来改进 SAC。这种回溯正则的灵感来自于回溯损失，这是一种在监督环境下提高准确性的最新技术。我们提出，在批判者目标中使用回溯损失作为正则可加速批判者学习。由于演员批评算法具有双时标性质，加速批评者收敛可以在整个训练过程中为演员提供更好的梯度估计，并提高最终性能。我们通过在多个连续控制环境中进行广泛的实验评估和分析，验证了我们关于提高最终性能的说法。我们还比较了回溯损失与另一种批判加速基线技术。我们提出了 SARC 中批判加速的证据。我们还调整了 SARC 的熵正则器

，以证明 SARC 在

的不同值上都能继续提供性能增益。作为未来的发展方向，我们的目标是将回溯正则化扩展到其他行为批判算法。

References

[1] Joshua Achiam. Spinning Up in Deep Reinforcement Learning. 2018.
[1] Joshua Achiam.深度强化学习中的旋转》。2018.

[2] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
(a) Walker-Stand

(d) Cheetah-Run

(b) Walker-Walk

(e) Reacher Easy

(f) Reacher Hard

Figure 5: Even after increasing network size to [400,300], SARC (red) continues to provide consistent gains over SAC. The

-axis shows the timesteps in each environment. The

-axis shows the average value and standard deviation band of return across 5 seeds (10 test episode returns per seed).
图 5：即使将网络规模扩大到 [400,300]，SARC（红色）仍比 SAC 有持续的提升。

轴显示了每个环境中的时间步数。

轴显示了 5 个种子（每个种子 10 个测试集）的平均返回值和标准偏差带。

[3] Vivek S Borkar and Vijaymohan R Konda. The actor-critic algorithm as multi-time-scale stochastic approximation. Sadhana, 22(4):525-543, 1997.

[4] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016.
[4] Erwin Coumans 和白云飞。用于游戏、机器人和机器学习的物理模拟 python 模块 Pybullet。2016.

[5] Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587-1596. PMLR, 2018.
[5] Scott Fujimoto、Herke Hoof 和 David Meger.解决行为批判方法中的函数近似误差。国际机器学习大会》，第 1587-1596 页。PMLR，2018。

[6] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861-1870. PMLR, 2018.
[6] Tuomas Haarnoja、Aurick Zhou、Pieter Abbeel 和 Sergey Levine。软演员批判：具有随机行为者的非政策最大熵深度强化学习。国际机器学习大会，第 1861-1870 页。PMLR，2018。

[7] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[7] Peter Henderson、Riashat Islam、Philip Bachman、Joelle Pineau、Doina Precup 和 David Meger。重要的深度强化学习。In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

[8] Surgan Jandial, Ayush Chopra, Mausoom Sarkar, Piyush Gupta, Balaji Krishnamurthy, and Vineeth Balasubramanian. Retrospective loss: Looking back to improve training of deep neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1123-1131, 2020.
[8] Surgan Jandial、Ayush Chopra、Mausoom Sarkar、Piyush Gupta、Balaji Krishnamurthy 和 Vineeth Balasubramanian。回溯损失：回顾过去，改进深度神经网络的训练。第 26 届 ACM SIGKDD 知识发现与数据挖掘国际会议论文集》，第 1123-1131 页，2020 年。

[9] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? arXiv preprint arXiv:1807.03765, 2018.

[10] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008-1014. Citeseer, 2000.
[10] Vijay R Konda 和 John N Tsitsiklis.行为批判算法。神经信息处理系统进展》，第 1008-1014 页。Citeseer, 2000.

[11] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[12] Volodymyr Mnih、Koray Kavukcuoglu、David Silver、Alex Graves、Ioannis Antonoglou、Daan Wierstra 和 Martin Riedmiller。用深度强化学习玩阿塔里游戏》，arXiv preprint arXiv:1312.5602, 2013.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529-533, 2015.

[14] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928-1937. PMLR, 2016.
[14] Volodymyr Mnih、Adria Puigdomenech Badia、Mehdi Mirza、Alex Graves、Timothy Lillicrap、Tim Harley、David Silver 和 Koray Kavukcuoglu。深度强化学习的异步方法。机器学习国际会议，第 1928-1937 页。PMLR，2016。

[15] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889-1897. PMLR, 2015.
[15] John Schulman、Sergey Levine、Pieter Abbeel、Michael Jordan 和 Philipp Moritz.信任区域策略优化。国际机器学习会议，第 1889-1897 页。PMLR, 2015.

[16] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[17] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387-395. PMLR, 2014.
[17] David Silver、Guy Lever、Nicolas Heess、Thomas Degris、Daan Wierstra 和 Martin Riedmiller.确定性策略梯度算法。机器学习国际会议，第 387-395 页。PMLR, 2014.

[18] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018 .
[18] 理查德-萨顿和安德鲁-巴托。强化学习：导论.麻省理工学院出版社，2018 .

[19] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPs, volume 99, pages 1057-1063. Citeseer, 1999.
[19] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation.NIPs 第 99 卷，第 1057-1063 页。Citeseer, 1999.

[20] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.

[21] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, pages 255-263. Hillsdale, NJ, 1993.
[21] Sebastian Thrun 和 Anton Schwartz.在强化学习中使用函数近似的问题。第四届联结模型暑期班论文集》，第 255-263 页。新泽西州希尔斯代尔，1993 年。

[22] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026-5033, 2012. doi:

IROS.2012.6386109.
[22] E. Todorov、T. Erez 和 Y. Tassa.Mujoco：基于模型控制的物理引擎。doi:

IROS.2012.6386109.

[23] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
[23] Hado Van Hasselt、Arthur Guez 和 David Silver。双q学习的深度强化学习。美国人工智能学会会议论文集》，第 30 卷，2016 年。

[24] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.

[25] Marco A Wiering. Convergence and divergence in standard and averaging reinforcement learning. In European Conference on Machine Learning, pages 477-488. Springer, 2004.
[25] Marco A Wiering.标准强化学习和平均强化学习中的收敛与发散。In European Conference on Machine Learning, pages 477-488.Springer, 2004.

[26] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229-256, 1992.
[26] Ronald J Williams.连接强化学习的简单统计梯度跟随算法。机器学习，8（3-4）：229-256，1992.

[27] Yue Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite time analysis of two time-scale actor critic methods. arXiv preprint arXiv:2005.01350, 2020.
[27] Yue Wu, Weitong Zhang, Pan Xu, and Quanquan Gu.ArXiv preprint arXiv:2005.01350, 2020.

[28] Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. arXiv preprint arXiv:1708.05144, 2017.
[28] Yuhuai Wu, Elman Mansimov, Shun Liao, Roger Grosse, and Jimmy Ba.使用克朗克因子近似的可扩展深度强化学习信任区域方法》，arXiv preprint arXiv:1708.05144, 2017.

A PyBullet Results for Section 5: Results
第 5 节的 PyBullet 结果：结果

We have demonstrated SARC using tasks from the DeepMind Control Suite which uses the MuJoCo Physics Engine [22]. In this section, we present comparisons on a set of environments that use the Bullet Physics Engine [4], an open-source physics simulator. We perform these evaluations on 5 PyBullet [4] environments: Ant, HalfCheetah, Hopper, Reacher, Walker-2D.
我们使用 DeepMind 控制套件中的任务演示了 SARC，该套件使用 MuJoCo 物理引擎[22]。在本节中，我们将对一组使用开源物理模拟器 Bullet 物理引擎 [4] 的环境进行比较。我们在 5 个 PyBullet [4] 环境中进行了评估：Ant、HalfCheetah、Hopper、Reacher、Walker-2D。

Table 2 shows the average value across 5 seeds ( 10 test episode returns per seed) obtained for each of the 5 tasks after training for 2 million timesteps. It can be observed that, at the end of training, SARC outperforms or is at par with existing approaches on 4 out of 5 environments presented in bold.
表 2 显示了 5 项任务中每项任务在训练 200 万个时间步后，5 个种子（每个种子 10 个测试集返回）的平均值。可以看出，在训练结束时，SARC 在黑体显示的 5 个环境中的 4 个环境中的表现优于或与现有方法相当。

In figure 6, we present the mean Monte Carlo returns over 10 test episodes at various steps during training. The

-axis shows the timesteps in each environment and the

-axis shows the mean and standard deviation band of above specified returns across 5 seeds.
在图 6 中，我们展示了训练过程中不同阶段 10 次测试的蒙特卡洛平均收益率。

轴显示了每个环境中的时间步数，

轴显示了上述指定收益在 5 个种子中的平均值和标准偏差带。

It can be observed that, on Ant, Half-Cheetah and Walker-2D, SARC achieves a higher return value faster than SAC. On Hopper and Reacher, SARC overlaps with and does not degrade the performance of SAC. Even on changing the underlying physics simulator, SARC continues providing consistent improvement over SAC.
可以看出，在 Ant、Half-Cheetah 和 Walker-2D 上，SARC 比 SAC 更快地获得了更高的返回值。在 Hopper 和 Reacher 上，SARC 与 SAC 的性能重叠，并没有降低 SAC 的性能。即使改变底层物理模拟器，SARC 仍能持续改进 SAC 的性能。

We also hope that using an open-source physics simulator will help with reproducibility and reach of our work.
我们还希望，使用开源物理模拟器将有助于提高我们工作的可重复性和影响力。

B PyBullet Results for Section 6.4: Increasing actor and critic network complexity
B 第 6.4 节的 PyBullet 结果：增加行动者和评论者网络的复杂性

Figure 7 shows results for SARC and SAC on PyBullet Envrironments after increasing Actor and Critic network sizes to [400,300]. Even after increasing network size to [400,300], SARC continues to provide consistent gains over SAC.
图 7 显示了 SARC 和 SAC 在 PyBullet 环境中将 Actor 和 Critic 网络大小增加到 [400,300] 之后的结果。即使将网络规模增加到 [400,300]，SARC 仍比 SAC 有持续的优势。

Table 2: Test episode returns of SARC and SAC on 5 PyBullet Environments after training for

steps.
表 2：在 5 个 PyBullet 环境中对 SARC 和 SAC 进行

步训练后的测试结果。

Environment	Avg Test Ep Return
Environment	SAC	SARC
Ant	1581.59
Half Cheetah	1701.19
Hopper		1975.58
Reacher
Walked-2D	1093.64

(i) Ant (g) Hopper

(j) Half Cheetah (h) Reacher

(k) Walker-2D

Figure 6: Results for SARC (red) and SAC on 5 PyBullet Environments. The x-axis shows the timesteps in each environment. The

-axis shows the average value and standard deviation band of return across 5 seeds ( 10 test episode returns per seed). It can be observed that SARC improves SAC in most environments even after modifying the underlying physics engine.
图 6：SARC（红色）和 SAC 在 5 个 PyBullet 环境中的结果。x 轴显示了每个环境中的时间步长。

轴显示 5 个种子（每个种子有 10 个测试集返回）返回的平均值和标准偏差带。可以看出，即使修改了底层物理引擎，SARC 在大多数环境中仍能改善 SAC。

(a) AntBulletEnv-v0

(b) HalfCheetahBulletEnv-v0

(d) ReacherBulletEnv-v0

(e) Walker2DBulletEnv-v0

Figure 7: Results for SARC (red) and SAC on PyBullet Envrironments after increasing Actor and Critic network sizes to [400,300]. Even after increasing network size to [400,300], SARC (red) continues to provide consistent gains over SAC. The

-axis shows the timesteps in each environment. The

-axis shows the average value and standard deviation band of return across 5 seeds ( 10 test episode returns per seed)
图 7：将 Actor 和 Critic 网络大小增加到 [400,300] 后，SARC（红色）和 SAC 在 PyBullet 环境上的结果。即使将网络规模增加到 [400,300]，SARC（红色）仍能持续保持对 SAC 的优势。

轴显示了每个环境中的时间步数。

轴显示了 5 个种子的平均返回值和标准偏差带（每个种子有 10 个测试集返回值）。

Work done while at Adobe

SARC: Soft Actor Retrospective Critic

Abstract

1 Introduction

2 Related Work

3 Soft Actor Retrospective Critic

3.1 Premise

3.2 Formalising SARC

4 Experimental Setup

5 Results

6 Discussion

6.1 Increasing critic update frequency for SAC6.1 增加对 SAC 的评论更新频率

6.2 Modifying entropy regularization coefficient

6.3 Tracking SARC critic convergence with respect to Monte Carlo policy evaluation6.3 跟踪 SARC 评论者收敛与蒙特卡罗政策评估

6.4 Increasing actor and critic network complexity6.4 行动者和评论者网络复杂性的增加

7 Conclusion and Future Work

References

A PyBullet Results for Section 5: Results第 5 节的 PyBullet 结果：结果

B PyBullet Results for Section 6.4: Increasing actor and critic network complexityB 第 6.4 节的 PyBullet 结果：增加行动者和评论者网络的复杂性

6.1 Increasing critic update frequency for SAC
6.1 增加对 SAC 的评论更新频率

6.3 Tracking SARC critic convergence with respect to Monte Carlo policy evaluation
6.3 跟踪 SARC 评论者收敛与蒙特卡罗政策评估

6.4 Increasing actor and critic network complexity
6.4 行动者和评论者网络复杂性的增加

A PyBullet Results for Section 5: Results
第 5 节的 PyBullet 结果：结果

B PyBullet Results for Section 6.4: Increasing actor and critic network complexity
B 第 6.4 节的 PyBullet 结果：增加行动者和评论者网络的复杂性