21

In many reinforcement learning papers, it is stated that for estimating the value function, one of the advantages of using temporal difference methods over the Monte Carlo methods is that they have a lower variance for computing value function. I have not been able to find any formal proof for this.
在众多强化学习论文中,常提到使用时间差分法估计值函数相较于蒙特卡洛方法的一个优势在于,它们在计算值函数时具有更低的variance。然而,我尚未找到任何对此的正式证明。

Moreover, it is also said that the Monte Carlo method is less biased when compared with TD methods. If somebody can help me better understand this phenomenon, I would appreciate it.
此外,据说与 TD 方法相比,蒙特卡罗方法的偏差较小。如果有人能帮我更好地理解这一现象,我将不胜感激。

CC BY-SA 4.0

1 Answer 1

26

The difference between the algorithms, is how they set a new value target based on experience.
算法之间的差异在于它们如何根据经验设定新的目标值。

Using action values to make it a little more concrete, and sticking with on-policy evaluation (not control) to keep arguments simple, then the update to estimate Q(St,At) takes the same general form for both TD and Monte Carlo:
使用动作值使其更具体一些,并坚持采用同策评估(非控制)以简化论证,那么对估计 Q(St,At) 的更新,无论是 TD 方法还是蒙特卡洛方法,都采用相同的基本形式:

Q(St,At)=Q(St,At)+α(XQ(St,At))

Where X is the value of an estimate for the true value of Q(St,At) gained through some experience. When discussing whether an RL value-based technique is biased or has high variance, the part that has these traits is whatever stands for X in this update.
其中 X 是通过某些经验获得的 Q(St,At) 真实值的估计值。在讨论基于值的强化学习技术是否存在偏差或具有高variance时,具有这些特性的部分是更新中代表 X 的部分。

For Monte Carlo techniques, the value of X is estimated by following a sample trajectory starting from (St,At) and adding up the rewards to the end of an episode (at time τ):
对于蒙特卡洛技术, X 的值是通过从 (St,At) 开始跟踪样本轨迹,并将奖励累加至一个回合结束(在时间 τ 时)来估计的:

k=0τt1γkRt+k+1

For temporal difference, the value of X is estimated by taking one step of sampled reward, and bootstrapping the estimate using the Bellman equation for qπ(s,a):
对于时间差分法, X 的值是通过采取一步采样奖励来估计的,并利用贝尔曼方程对 qπ(s,a) 进行自举估计:

Rt+1+γQ(St+1,At+1)

Bias

In both cases, the true value function that we want to estimate is the expected return after taking the action At in state St and following policy π. This can be written:
在这两种情况下,我们希望估计的真实价值函数是在状态 St 下采取动作 At 并遵循策略 π 之后的期望回报。这可以表示为:

q(s,a)=Eπ[k=0τt1γkRt+k+1|St=s,At=a]

That should look familiar. The Monte Carlo target for X is clearly a direct sample of this value, and has the same expected value that we are looking for. Hence it is not biased.
这看起来应该很熟悉。对于 X 的蒙特卡洛目标显然是该值的直接样本,并且具有我们正在寻找的相同期望值。因此,它没有偏差。

TD learning however, has a problem due to initial states. The bootstrap value of Q(St+1,At+1) is initially whatever you set it to, arbitrarily at the start of learning. This has no bearing on the true value you are looking for, hence it is biased. Over time, the bias decays exponentially as real values from experience are used in the update process. At least that is true for basic tabular forms of TD learning. When you add a neural network or other approximation, then this bias can cause stability problems, causing an RL agent to fail to learn.
然而,TD 学习存在一个由初始状态引起的问题。 Q(St+1,At+1) 的引导值最初是您在学习开始时任意设定的,这与您寻求的真实值无关,因此具有偏差。随着时间的推移,bias会随着更新过程中使用实际经验值而呈指数级衰减。至少对于基本的表格形式的 TD 学习来说,这是成立的。当引入神经网络或其他近似方法时,这种bias可能导致稳定性问题,使得强化学习agent无法学习。

Variance

Looking at how each update mechanism works, you can see that TD learning is exposed to 3 factors, that can each vary (in principle, depending on the environment) over a single time step:
观察每种更新机制的运作方式,你会发现 TD 学习受到 3 个因素的影响,这些因素在每个时间步长内(原则上,取决于环境)都可能发生变化:

  • What reward Rt+1 will be returned
    将会返回什么奖励 Rt+1

  • What the next state St+1 will be.
    下一个状态 St+1 将会是什么。

  • What the policy will choose for At+1
    政策将为 At+1 选择什么

Each of these factors may increase the variance of the TD target value. However, the bootstrapping mechanism means that there is no direct variance from other events on the trajectory. That's it.
这些因素中的每一个都可能增加 TD 目标值的variance。然而,自举机制意味着轨迹上的其他事件不会对其产生直接的variance。就是这样。

In comparison, the Monte Carlo return depends on every return, state transition and policy decision from (St,At) up to Sτ. As this is often multiple time steps, the variance will also be a multiple of that seen in TD learning on the same problem. This is true even in situations with deterministic environments with sparse deterministic rewards, as an exploring policy must be stochastic, which injects some variance on every time step involved in the sampled value estimate.
相比之下,蒙特卡洛回报依赖于从 (St,At) Sτ 每一个回报、状态转移和策略决策。由于这通常涉及多个时间步,因此variance也将是 TD 学习在同一问题上所见倍数的倍数。即使在具有稀疏确定性奖励的确定性环境中也是如此,因为探索策略必须是随机的,这会在采样值估计所涉及的每个时间步中注入一定的variance。

CC BY-SA 4.0
2
  • 2
    Thanks for your answer. Your answer gives a good intuition behind the bias-variance trade-off of TD and Monte Carlo methods. However, I am looking for a little more formal mathematical proof here with probably using convergence bounds.
    感谢您的回答。您的回答很好地阐释了 TD 与蒙特卡洛方法之间的bias-variance权衡。然而,我在此寻求更正式的数学证明,可能涉及收敛边界的运用。
    – Infintyyy
    Commented Jul 16, 2018 at 20:05
  • @Infintyyy: Although I am sure a more formal answer is possible, I don't think you will get as far as showing bounds, as it is possible to construct zero-bias TD results and low variance Monte Carlo results through choice of environment plus policy. Bounds could possibly be derived analytically for concrete examples, but I don't know if possible for general RL.
    @Infintyyy:尽管我相信可以给出更正式的答案,但我认为你不会走得太远去证明界限,因为通过选择环境和策略,有可能构建出零bias TD 结果和低variance蒙特卡洛结果。对于具体例子,或许可以解析地推导出界限,但我不确定这在一般强化学习中是否可行。
    Commented Jul 17, 2018 at 13:00

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.