The difference between the algorithms, is how they set a new value target based on experience.
算法之间的差异在于它们如何根据经验设定新的目标值。
Using action values to make it a little more concrete, and sticking with on-policy evaluation (not control) to keep arguments simple, then the update to estimate 𝑄(𝑆𝑡,𝐴𝑡) takes the same general form for both TD and Monte Carlo:
使用动作值使其更具体一些,并坚持采用同策评估(非控制)以简化论证,那么对估计 𝑄(𝑆𝑡,𝐴𝑡) 的更新,无论是 TD 方法还是蒙特卡洛方法,都采用相同的基本形式:
𝑄(𝑆𝑡,𝐴𝑡)=𝑄(𝑆𝑡,𝐴𝑡)+𝛼(𝑋−𝑄(𝑆𝑡,𝐴𝑡))
Where 𝑋 is the value of an estimate for the true value of 𝑄(𝑆𝑡,𝐴𝑡) gained through some experience. When discussing whether an RL value-based technique is biased or has high variance, the part that has these traits is whatever stands for 𝑋 in this update.
其中 𝑋 是通过某些经验获得的 𝑄(𝑆𝑡,𝐴𝑡) 真实值的估计值。在讨论基于值的强化学习技术是否存在偏差或具有高variance时,具有这些特性的部分是更新中代表 𝑋 的部分。
For Monte Carlo techniques, the value of 𝑋 is estimated by following a sample trajectory starting from (𝑆𝑡,𝐴𝑡) and adding up the rewards to the end of an episode (at time 𝜏):
对于蒙特卡洛技术, 𝑋 的值是通过从 (𝑆𝑡,𝐴𝑡) 开始跟踪样本轨迹,并将奖励累加至一个回合结束(在时间 𝜏 时)来估计的:
∑𝑘=0𝜏−𝑡−1𝛾𝑘𝑅𝑡+𝑘+1
For temporal difference, the value of 𝑋 is estimated by taking one step of sampled reward, and bootstrapping the estimate using the Bellman equation for 𝑞𝜋(𝑠,𝑎):
对于时间差分法, 𝑋 的值是通过采取一步采样奖励来估计的,并利用贝尔曼方程对 𝑞𝜋(𝑠,𝑎) 进行自举估计:
𝑅𝑡+1+𝛾𝑄(𝑆𝑡+1,𝐴𝑡+1)
Bias
In both cases, the true value function that we want to estimate is the expected return after taking the action 𝐴𝑡 in state 𝑆𝑡 and following policy 𝜋. This can be written:
在这两种情况下,我们希望估计的真实价值函数是在状态 𝑆𝑡 下采取动作 𝐴𝑡 并遵循策略 𝜋 之后的期望回报。这可以表示为:
𝑞(𝑠,𝑎)=𝔼𝜋[∑𝑘=0𝜏−𝑡−1𝛾𝑘𝑅𝑡+𝑘+1|𝑆𝑡=𝑠,𝐴𝑡=𝑎]
That should look familiar. The Monte Carlo target for 𝑋 is clearly a direct sample of this value, and has the same expected value that we are looking for. Hence it is not biased.
这看起来应该很熟悉。对于 𝑋 的蒙特卡洛目标显然是该值的直接样本,并且具有我们正在寻找的相同期望值。因此,它没有偏差。
TD learning however, has a problem due to initial states. The bootstrap value of 𝑄(𝑆𝑡+1,𝐴𝑡+1) is initially whatever you set it to, arbitrarily at the start of learning. This has no bearing on the true value you are looking for, hence it is biased. Over time, the bias decays exponentially as real values from experience are used in the update process. At least that is true for basic tabular forms of TD learning. When you add a neural network or other approximation, then this bias can cause stability problems, causing an RL agent to fail to learn.
然而,TD 学习存在一个由初始状态引起的问题。 𝑄(𝑆𝑡+1,𝐴𝑡+1) 的引导值最初是您在学习开始时任意设定的,这与您寻求的真实值无关,因此具有偏差。随着时间的推移,bias会随着更新过程中使用实际经验值而呈指数级衰减。至少对于基本的表格形式的 TD 学习来说,这是成立的。当引入神经网络或其他近似方法时,这种bias可能导致稳定性问题,使得强化学习agent无法学习。
Variance
Looking at how each update mechanism works, you can see that TD learning is exposed to 3 factors, that can each vary (in principle, depending on the environment) over a single time step:
观察每种更新机制的运作方式,你会发现 TD 学习受到 3 个因素的影响,这些因素在每个时间步长内(原则上,取决于环境)都可能发生变化:
What reward 𝑅𝑡+1 will be returned
将会返回什么奖励 𝑅𝑡+1
What the next state 𝑆𝑡+1 will be.
下一个状态 𝑆𝑡+1 将会是什么。
What the policy will choose for 𝐴𝑡+1
政策将为 𝐴𝑡+1 选择什么
Each of these factors may increase the variance of the TD target value. However, the bootstrapping mechanism means that there is no direct variance from other events on the trajectory. That's it.
这些因素中的每一个都可能增加 TD 目标值的variance。然而,自举机制意味着轨迹上的其他事件不会对其产生直接的variance。就是这样。
In comparison, the Monte Carlo return depends on every return, state transition and policy decision from (𝑆𝑡,𝐴𝑡) up to 𝑆𝜏. As this is often multiple time steps, the variance will also be a multiple of that seen in TD learning on the same problem. This is true even in situations with deterministic environments with sparse deterministic rewards, as an exploring policy must be stochastic, which injects some variance on every time step involved in the sampled value estimate.
相比之下,蒙特卡洛回报依赖于从 (𝑆𝑡,𝐴𝑡) 到 𝑆𝜏 的每一个回报、状态转移和策略决策。由于这通常涉及多个时间步,因此variance也将是 TD 学习在同一问题上所见倍数的倍数。即使在具有稀疏确定性奖励的确定性环境中也是如此,因为探索策略必须是随机的,这会在采样值估计所涉及的每个时间步中注入一定的variance。