2024_06_13_b2395b3561dbc38b7c47g

V-function is computed by summing the expected discounted waiting time (EDWT) at the current traffic light and the EDWT from the next possible lights:
V 功能的计算方法是将当前红绿灯的预期折现等待时间（EDWT）与下一个可能红绿灯的预期折现等待时间相加：

Where

is the EDWT at the current traffic light, and

is the number of cars waiting at the next possible traffic light

. The

-function can be computed as:
其中，

是当前红绿灯处的 EDWT，

是在下一个可能的红绿灯

处等待的车辆数。

函数的计算公式为

where

(the intra-node Q-function) denotes the EDWT at the current light for cars given the decision of the traffic light:
其中，

（节点内 Q 函数）表示给定交通信号灯决策后，汽车在当前信号灯处的 EDWT：

The

-values can finally be computed as follows:

值的最终计算方法如下：

We compute

by tracking a car standing on a specific place. Thus, we record tuples

, place, des,

and finally associate them with

, the place where the car arrives in the next queue. If there are multiple states of a car on (tl,place) with the same

and

, we only count the transition step to the next

a single time (this is similar to the first visit sampling method (Singh & Sutton, 1996)).
我们通过跟踪一辆停在特定地点的汽车来计算

。因此，我们记录图元

、地点、des、

，最后将它们与

关联，即汽车到达下一个队列的地点。如果一辆车在 (tl,place) 上有多个具有相同

和

的状态，我们只计算一次到下一个

的过渡步骤（这类似于首次访问采样法（Singh & Sutton, 1996））。

Adapting the system parameters. For adapting the systems, we update the state transition probabilities after each time-step by tracking car-movements. Remember that the reward function is fixed (standing still costs 1 , otherwise the reward/cost is 0 ). To compute transition probabilities, we just count the number of transitions from a car-state to all next car-states and divide these by the total number of transitions from that car-state.

调整系统参数。为了调整系统，我们在每个时间步骤后都会通过跟踪汽车运动来更新状态转换概率。请记住，奖励函数是固定的（静止不动的代价为 1，否则奖励/代价为 0）。要计算转换概率，我们只需计算从一个汽车状态到所有下一个汽车状态的转换次数，然后除以从该汽车状态转换的总次数。

Co-learning driving policies. A nice feature of our car-based value functions is that they can be immediately used to select a path of traffic lights to the destination address. Note that our city (Figure 1) is like
共同学习驾驶策略。我们基于汽车的价值函数有一个很好的特点，那就是它们可以立即用于选择通往目的地地址的红绿灯路径。请注意，我们的城市（图 1）就像

Manhattan and from one starting place to a destination address there can be multiple shortest paths. The non-adaptable systems generate at each traffic light what the options are to go from one traffic light to the next one in order to go to the destination address and select one of these randomly. Co-learning can be used to select among these shortest paths that path with minimal expected waiting time. For TC-1, we compare the values

to determine the best next traffic light

for a car crossing an intersection. For TC-2 and TC-3, we can compute the Q-values for going to the next traffic light

using global information. We compute the values

for going to a next traffic light (given the current light

) as follows:
在曼哈顿，从一个起点到一个终点地址可能有多条最短路径。非适应性系统会在每个红绿灯处生成从一个红绿灯到下一个红绿灯的路径选择，并随机选择其中一个路径到达目的地。共同学习可用于从这些最短路径中选择预期等待时间最短的路径。对于 TC-1，我们比较

的值，以确定汽车通过十字路口时的最佳下一个红绿灯

。对于 TC-2 和 TC-3，我们可以利用全局信息计算出前往下一个红绿灯

的 Q 值。我们计算前往下一个红绿灯的

值（给定当前红绿灯

）如下：

, green,

，绿色，

and choose the traffic light

with the lowest

.
并选择

最低的红绿灯

。

4. Experiments 4.实验

We execute experiments with 10 systems: a random controller for each traffic node, a fixed controller which iterates over all traffic node decisions, a controller which lets the largest queues go first, a controller which tries to let most cars pass the intersection, and our three RL systems: TC-1, TC-2, and TC-3, with or without co-learning. For our experiments we use the city depicted in Figure 1.
我们用 10 个系统进行了实验：每个交通节点的随机控制器、对所有交通节点决策进行迭代的固定控制器、让最大队列先行的控制器、试图让大多数车辆通过交叉路口的控制器，以及我们的三个 RL 系统：以及我们的三个 RL 系统：TC-1、TC-2 和 TC-3，有无共同学习均可。在实验中，我们使用了图 1 所示的城市。

Set-up of traffic simulations. The traffic pattern is a fully randomized pattern where random starting traffic lights at the border of the city ( 20 possibilities) are selected for each newly inserted car and a random destination addresses is used for the car (10 possibilities).

At each cycle (time-step), 1 to 8 cars are inserted in the city, all with different starting traffic lights, since cars cannot occupy the same initial place at the same traffic light. Therefore it is also possible that the traffic network becomes saturated, where cars are refused since we cannot add more cars when all 20 possible starting positions are occupied.
交通模拟的设置。交通模式是一种完全随机的模式，为每辆新插入的汽车选择城市边界的随机起始红绿灯（20 种可能），并为汽车使用随机的目的地地址（10 种可能）。

在每个周期（时间步长）内，城市中会插入 1 到 8 辆汽车，所有汽车的起始红绿灯都不同，因为汽车不可能在同一个红绿灯处占据相同的初始位置。因此，也有可能出现交通网络饱和的情况，当所有 20 个可能的起始位置都被占用时，我们就无法再增加车辆，从而导致车辆被拒绝。

Systems and parameters. The Random system selects the decision at each traffic node randomly, the Fixed system starts with decision 1 for all traffic nodes for one time-step, then selects decision 2 at the next time-step, until it has selected all six decisions and starts again with decision 1. The Longest Q system counts the number of cars which would not have to wait for a red light for each decision and selects the traffic node decision leading to the maximum. The
系统和参数。随机系统随机选择每个交通节点的决策，固定系统在一个时间步长内从所有交通节点的决策 1 开始，然后在下一个时间步长选择决策 2，直到选择完所有 6 个决策，再从决策 1 开始。最长 Q 系统计算每个决策中无需等待红灯的车辆数，并选择导致最多车辆数的交通节点决策。最长 Q

Table 2. Final waiting time results for different systems when adding 1-3 cars per time-step. Results are averages over 10 simulations.

used a real number for the gain variable.
表 2.每个时间步增加 1-3 辆车时，不同系统的最终等待时间结果。结果为 10 次模拟的平均值。

使用实数作为增益变量。

SYSTEM	1 CAR	2 CARS 2 辆车	3 CARS 3 辆汽车
RANDOM
FIXED
LONGEST Q 最长 Q
MOST CARS 最多汽车
TC-1
TC-1 CO
TC-2
TC-2 CO
TC-3
TC-3 CO

Table 3. Final waiting time results and the nr. of refused cars

for the systems when adding 4 cars per time-step.

randomness is used in the action selection.
表 3.每个时间步增加 4 辆车时，系统的最终等待时间结果和被拒车辆数

。

随机性用于行动选择。

SYSTEM	WAITING TIME 等待时间	REFUSED CARS 被拒车辆
RANDOM
FiXED
LoNGEST
MosT CARS 汽车
TC-1
TC-1 Co
TC-2
TC-2 CO
TC-3*
TC-3 CO*

Most cars system examines how many cars can pass an intersection given some traffic node decision, and selects the decision which is expected to let most cars (0-2) cross an intersection. TC-1, TC-2, and TC-3, with or without co-learning use

, one valuefunction iteration per time-step and no exploration, except for systems which get stuck in dead networks where no cars can drive anymore (which sometimes happens with the TC-3, Longest Q, and Most cars systems), for which we add

random actions to the decision policy. We let each system run until 50,000 cars have exited the city and record simulation results after each 2000 cars have left the city. Results are averages over 10 simulations.
大多数汽车系统会在给定某个交通节点决策的情况下，研究有多少辆汽车可以通过交叉路口，并选择有望让大多数汽车（0-2 辆）通过交叉路口的决策。TC-1、TC-2 和 TC-3，无论有无协同学习，都使用

，每个时间步进行一次值函数迭代，不进行探索，但如果系统陷入死网，不再有车可以行驶（TC-3、Longest Q 和 Most cars 系统有时会出现这种情况），我们会在决策策略中添加

随机行动。我们让每个系统运行到 50,000 辆汽车驶出城市，并在每 2000 辆汽车驶出城市后记录模拟结果。结果为 10 次模拟的平均值。

Experimental results. Table 2 shows the final (after 50,000 steps) average waiting time results for the last 2000 cars exiting the city when adding 1 to 3 cars per time-step. When adding a single car at each cycle, TC-3 with co-learning works best, closely followed by the other co-learning RL systems. The random system performs worst with a waiting time which is more than 23 times longer than the best algorithms. When adding 2 cars the results are quite similar. TC-3 and TC-2 with co-learning works best followed by TC-1 with co-learning. When adding 3 cars, TC-3 with colearning works best followed by the other RL systems. Longest Q performs

worse and Most cars performs

worse than TC-3 with co-learning. The random and fixed systems again come last and even result in saturating behavior - see Figure 2(A). Note that although the differences are not so large, TC-3 with colearning always significantly (t-test,

) outperforms all fixed systems.
实验结果表 2 显示了在每个时间步增加 1 到 3 辆车时，最后（50,000 步后）驶出城市的 2000 辆车的平均等待时间结果。当每个周期增加一辆车时，具有协同学习功能的 TC-3 运行效果最好，其他协同学习 RL 系统紧随其后。随机系统表现最差，等待时间比最佳算法长 23 倍以上。当增加 2 辆车时，结果非常相似。具有协同学习功能的 TC-3 和 TC-2 效果最好，具有协同学习功能的 TC-1 紧随其后。当增加 3 辆车时，采用共同学习的 TC-3 效果最好，其次是其他 RL 系统。最长 Q 的性能

差，最多车的性能

差于采用共同学习的 TC-3。随机系统和固定系统再次排在最后，甚至出现了饱和行为--见图 2(A)。请注意，虽然差异不大，但采用协同学习的 TC-3 总是明显优于所有固定系统（t 检验，

）。

Adding four cars. Table 3 shows the results when adding 4 cars. Here the network starts to saturate for all algorithms. Therefore not only the average waiting time is important, but also the number of refused cars. For Longest Q and TC-3, we had to add

noise in the action selection, since otherwise they got stuck in traffic situations where no cars could move anymore and no cars were able to enter the city. Such "dead" network states result from deterministic policies which set lights to green for cars which cannot cross the intersection, since the next road-lane is full and the next (or the one after the next) traffic light is set to red.
增加四辆汽车表 3 显示了增加 4 辆车时的结果。此时，所有算法的网络都开始饱和。因此，不仅平均等待时间很重要，拒绝车辆的数量也很重要。对于最长 Q 和 TC-3，我们必须在行动选择中添加

噪音，否则它们就会陷入没有汽车可以移动、没有汽车可以进入城市的交通状况。这种 "死亡 "网络状态是由确定性政策造成的，这些政策将无法通过十字路口的车辆的信号灯设置为绿灯，因为下一条车道已经满员，下一个（或下一个之后的）交通信号灯设置为红灯。

TC-2 with co-learning works best, followed by TC-1 with co-learning. They have the lowest waiting times and refuse the lowest number of cars. Note that the number of refused cars would make aligning traffic networks more crowded. Apparently, optimizing driving policies in busy traffic situations is very useful here. TC-3 refuses many cars during the initial learning phase, but finally obtains the best performance of the non co-learning systems. The Most cars algorithm results in fluctuating performance (waiting times). It does not refuse so many cars, though, which is different from the Longest

system which refuses by far the most cars.
有共同学习功能的 TC-2 效果最好，其次是有共同学习功能的 TC-1。它们的等待时间最短，拒载车辆数量最少。需要注意的是，拒载车辆的数量会使对齐交通网络更加拥挤。显然，在交通繁忙的情况下优化驾驶策略非常有用。TC-3 在初始学习阶段拒绝了很多车辆，但最终获得了非共同学习系统中的最佳性能。最多车算法的性能（等待时间）起伏不定。不过，它并没有拒绝那么多车辆，这与最长

系统不同，后者拒绝的车辆最多。

Saturation behavior for adding more cars. Figure 2(A) shows the total number of refused cars during a run when we increase traffic loads and Figure 2(B) shows the average final waiting times. When adding 5-8 cars, TC-2 with co-learning refuses the least number of cars. It is followed by TC-1 with co-learning. The Longest Q system performs worst. The fixed system also refuses many cars and this explains why its average waiting time is shortest for highly crowded traffic. The random system works quite well for very crowded roads; it seems that for such cases random decisions work reasonably well. The Most cars algorithm performs quite well, but suffers from fluctuating performance levels. All systems can use co-learning of driving policies to minimize the number of refused cars. The reason that TC-2 with co-learning works
增加车辆时的饱和行为。图 2（A）显示了当我们增加交通负荷时，运行过程中被拒绝的车辆总数，图 2（B）显示了平均最终等待时间。当增加 5 至 8 辆车时，采用共同学习的 TC-2 拒绝的车辆数量最少，其次是采用共同学习的 TC-1。其次是具有协同学习功能的 TC-1。最长 Q 系统表现最差。固定系统也会拒绝很多车辆，这也解释了为什么在交通非常拥挤的情况下，它的平均等待时间最短。在非常拥挤的道路上，随机系统的表现相当不错；在这种情况下，随机决策的效果似乎相当好。最多车辆算法的性能相当不错，但也存在性能水平波动的问题。所有系统都可以利用共同学习驾驶策略来尽量减少拒载车辆的数量。带有共同学习功能的 TC-2 系统之所以有效，原因在于

Figure 2. A comparison between the different adaptable and fixed systems on more or less crowded traffic patterns. (A): The average number of refused cars during an entire run. (B): The average waiting time of the last 2000 cars exiting the city. Results are averages over 10 simulations.
图 2.不同的自适应系统和固定系统在拥挤程度不同的交通模式下的对比。(A):整个运行过程中被拒车辆的平均数量。(B):最后 2000 辆驶出城市的汽车的平均等待时间。结果为 10 次模拟的平均值。

best, may be that it continuously adapts its policy, thereby making it non-stationary. Therefore it can react when particular decisions do not make sense, like setting the light to green while the first car cannot go to an overcrowded next road-lane. TC-3 is not able to continuously change its policy, and therefore it sometimes ends up in dead networks (when used without randomness). The reason is that only intra

values are adapted when cars remain waiting, and this is sometimes not sufficient to change the outcome of the voting process, since inter-Q values may have a large impact on the decision. Furthermore, communicating

is less useful if

is almost always 20 and the first car cannot drive. The Longest Q system suffers a lot from deadlock situations (no car can drive given some decisions of the traffic lights), but it is strange that even with

randomness it cannot overcome its problems (with more randomness it performs better, but even with

randomness, it performs worse than the random system).
最好的办法可能是不断调整政策，从而使其成为非稳态。因此，当某些决策不合理时，它可以做出反应，比如将信号灯设置为绿灯，而第一辆车却不能驶入拥挤不堪的下一条车道。TC-3 无法持续改变其策略，因此有时会导致网络死机（在不使用随机性的情况下）。原因在于，当车辆仍在等待时，只有

内的值会被调整，而这有时并不足以改变投票过程的结果，因为 Q 间的值可能会对决策产生很大影响。此外，如果

几乎总是 20，且第一辆车无法行驶，那么交流

的作用就不大了。最长 Q 值系统在死锁情况（在交通信号灯的某些决定下，没有一辆车能行驶）下受到很大影响，但奇怪的是，即使使用

随机性，它也无法克服自己的问题（随机性越大，它的表现越好，但即使使用

随机性，它的表现也比随机系统差）。

5. Discussion 5.讨论

For low traffic loads, constructing good (near-optimal) fixed controllers is not difficult, since all traffic nodes can operate locally. Therefore the gain in using RL for learning traffic light controllers is quite small, although learning driving policies is still useful. When we increase traffic load, the amount of interaction between traffic nodes increases, and the locally well performing fixed systems do not work well anymore. Furthermore, the dynamics of crowded traffic patterns are complex so that it is hard to design better controllers. Here, using RL systems for traffic light control is clearly beneficial. Co-learning driving policies is also very useful, since it helps to direct traffic flow in the city.
在交通负荷较低的情况下，构建良好（接近最优）的固定控制器并不困难，因为所有交通节点都可以在本地运行。因此，使用 RL 学习交通灯控制器的收益很小，尽管学习驾驶策略仍然有用。当我们增加交通负荷时，交通节点之间的交互量就会增加，本地性能良好的固定系统就不再能很好地工作了。此外，拥挤的交通模式动态复杂，很难设计出更好的控制器。在这种情况下，使用 RL 系统进行交通灯控制显然是有益的。共同学习驾驶策略也非常有用，因为它有助于引导城市中的交通流。

Co-learning. Learning driving policies at the same time as learning traffic light controllers show interesting co-learning phenomena: traffic nodes which are quite busy and thus have a hard task minimizing overall waiting time are relieved by the intelligent driving policies circumventing such intersections. Thereby the cars are reactively spreading in the city and help to minimize the shared value functions.
共同学习。在学习驾驶政策的同时学习交通灯控制器，显示出了有趣的共同学习现象：智能驾驶政策绕过了交通繁忙的节点，缓解了这些节点的压力，从而最大限度地减少了整体等待时间。这样，汽车就会在城市中被动地扩散，并有助于最大限度地减少共享价值函数。

Communication. The use of communicated information can help the RL systems to optimize traffic light controllers. Since traffic nodes are highly interdependent when regulating highly crowded traffic, we could also design different communication schemes in which traffic node decisions are communicated. We are currently studying methods for efficiently evaluating global decisions in this way.
通信。利用通信信息可以帮助 RL 系统优化交通灯控制器。由于在调节高度拥挤的交通时，交通节点之间的相互依赖性很强，因此我们还可以设计不同的通信方案，让交通节点的决策得以沟通。我们目前正在研究以这种方式有效评估全局决策的方法。

Related work. Thorpe and Anderson (1996) used direct RL to learn traffic controllers on a simulated traffic control problem consisting of a network of 4

traffic light controllers. They modelled average speed, queueing and acceleration/ deceleration of cars. The controller was trained on a single intersection after which it was copied to the other intersections. Results showed that using their best state representation (which indicates which segments of the roads were occupied by cars) RL learned to outperform algorithms which used fixed waiting times or allowed the largest queue to go first. A big difference between their and our approach is that their traffic node policy selects decisions based on a combined representation of the local traffic situation. To deal with the explosive number of states, they abstract away from a lot of information. Instead, we use car-based value functions and a voting scheme for selecting actions. This has the advantage that (local) optimal controllers may be obtained if the value functions are accurate, while we still do not suffer from huge state spaces. Furthermore, the car-based value functions can be used by the driving policies.
相关工作Thorpe 和 Anderson（1996 年）在一个由 4 个

交通灯控制器组成的网络所构成的模拟交通控制问题中，使用直接 RL 学习交通控制器。他们模拟了汽车的平均速度、排队和加速/减速。控制器先在一个交叉路口进行训练，然后复制到其他交叉路口。结果表明，使用他们的最佳状态表示法（表示哪些路段被汽车占用），RL 的学习效果优于使用固定等待时间或允许最大队列先行的算法。他们的方法与我们的方法最大的不同在于，他们的交通节点策略是根据当地交通状况的综合表示来选择决策的。为了处理爆炸性的状态数量，他们抽象掉了很多信息。相反，我们使用基于汽车的价值函数和投票方案来选择行动。这样做的好处是，如果价值函数准确，就可以获得（局部）最优控制器，同时我们仍然不会受到巨大状态空间的影响。此外，驾驶策略也可以使用基于汽车的价值函数。

For particular systems, we have to take communicated state information into account as well.
对于特定系统，我们还必须考虑到通信状态信息。
Due to particular impossible paths, generated cars cannot use all 200 combinations.
由于存在特定的不可能路径，生成的汽车无法使用所有 200 种组合。