这是用户在 2024-11-21 18:00 为 https://app.immersivetranslate.com/pdf-pro/1df23635-670a-4b92-8986-6f5e6e0bcf6c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

深度强化学习与图神经网络的结合:路由优化案例探索
Deep reinforcement learning meets graph neural networks: Exploring a routing optimization use case

更新 updates

Paul Almasan a, a,  ^("a, "){ }^{\text {a, }} , , José Suárez-Varela a a ^(a){ }^{\mathrm{a}} , Krzysztof Rusek b,a b,a  ^("b,a "){ }^{\text {b,a }} , Pere Barlet-Ros a a ^(a){ }^{\mathrm{a}} , Albert Cabellos-Aparicio a a ^(a){ }^{\mathrm{a}} .
Paul Almasan a, a,  ^("a, "){ }^{\text {a, }}, , José Suárez-Varela a a ^(a){ }^{\mathrm{a}}, Krzysztof Rusek b,a b,a  ^("b,a "){ }^{\text {b,a }}, Pere Barlet-Ros a a ^(a){ }^{\mathrm{a}}, Albert Cabellos-Aparicio a a ^(a){ }^{\mathrm{a}}
a a ^(a){ }^{a} 西班牙巴塞罗那加泰罗尼亚理工大学巴塞罗那神经网络中心
a a ^(a){ }^{a} Barcelona Neural Networking Center, Universitat Politècnica de Catalunya, Barcelona, Spain
b ^("b "){ }^{\text {b }} 波兰克拉科夫 AGH 科技大学电信研究所
b ^("b "){ }^{\text {b }} Institute of Telecommunications, AGH University of Science and Technology, Krakow, Poland

A R TICLE 信息  A R TICLE INFO

关键词:  Keywords:

图神经网络 Graph neural networks
深度强化学习
Deep reinforcement learning
路由  Routing
优化  Optimization

摘要  Abstract

深度强化学习(DRL)在决策和自动控制问题上取得了巨大进步。因此,DRL 是一种很有前途的技术,能有效解决自动驾驶网络中的许多相关优化问题(如路由选择)。然而,现有的基于 DRL 的网络解决方案无法实现泛化,这意味着它们在应用于训练期间未观察到的网络拓扑时无法正常运行。这种泛化能力的缺乏极大地阻碍了 DRL 技术在生产网络中的应用。这是因为最先进的基于 DRL 的网络解决方案使用的是标准神经网络(如全连接网络、卷积网络),而这些网络并不适合从图形结构的信息中学习。
Deep Reinforcement Learning (DRL) has shown a dramatic improvement in decision-making and automated control problems. Consequently, DRL represents a promising technique to efficiently solve many relevant optimization problems (e.g., routing) in self-driving networks. However, existing DRL-based solutions applied to networking fail to generalize, which means that they are not able to operate properly when applied to network topologies not observed during training. This lack of generalization capability significantly hinders the deployment of DRL technologies in production networks. This is because state-of-the-art DRL-based networking solutions use standard neural networks (e.g., fully connected, convolutional), which are not suited to learn from information structured as graphs.

在本文中,我们将图神经网络(GNN)集成到 DRL 代理中,并设计了一个针对具体问题的行动空间来实现泛化。GNN 是一种深度学习模型,其设计本身就能对不同大小和结构的图进行泛化。这使得基于 GNN 的 DRL 代理可以学习和泛化任意网络拓扑结构。我们在光网络路由优化使用案例中测试了 DRL+GNN 代理,并分别在 180 和 232 个未见过的合成网络拓扑和真实世界网络拓扑上对其进行了评估。结果表明,DRL+GNN 代理能够在训练期间从未见过的拓扑结构中超越最先进的解决方案。
In this paper, we integrate Graph Neural Networks (GNN) into DRL agents and we design a problem specific action space to enable generalization. GNNs are Deep Learning models inherently designed to generalize over graphs of different sizes and structures. This allows the proposed GNN-based DRL agent to learn and generalize over arbitrary network topologies. We test our DRL+GNN agent in a routing optimization use case in optical networks and evaluate it on 180 and 232 unseen synthetic and real-world network topologies respectively. The results show that the DRL+GNN agent is able to outperform state-of-the-art solutions in topologies never seen during training.

1.导言  1. Introduction

在过去几年中,工业进步(如工业 4.0、物联网)和社会行为的变化催生了现代网络应用的激增(如车载网络、AR/VR、实时通信),对骨干网络提出了新的要求(如高吞吐量和低延迟)。因此,网络运营商需要有效地管理网络资源,确保客户的服务质量并履行服务水平协议。这通常需要利用专家知识或利用整数线性规划(ILP)或约束编程(CP)的求解器来完成。然而,现实世界中的生产网络有数百个节点,基于 ILP 或 CP 的求解器需要大量时间来解决网络优化问题 [1,2]。此外,基于启发式的解决方案也远非最优。
In the last years, industrial advances (e.g., Industry 4.0, IoT) and changes in social behavior created a proliferation of modern network applications (e.g., Vehicular Networks, AR/VR, Real-Time Communications), imposing new requirements on backbone networks (e.g., high throughput and low latency). Consequently, network operators need to efficiently manage the network resources, ensuring the customer’s Quality of Service and fulfilling the Service Level Agreements. This is typically done using expert knowledge or solvers leveraging Integer Linear Programming (ILP) or Constraint Programming (CP). However, real-world production networks have in the order of hundreds of nodes and solvers based on ILP or CP would take a large amount of time to solve network optimization problems [1,2]. In addition, heuristic based solutions are far from being optimal.
深度强化学习(DRL)已在顺序决策和自动控制问题上显示出显著的改进[3,4]。因此,网络社区已经开始研究 DRL,将其作为网络优化(如路由)的一项关键技术,其中包括
Deep Reinforcement Learning (DRL) has shown significant improvements in sequential decision-making and automated control problems [3,4]. As a result, the network community is already investigating DRL as a key technology for network optimization (e.g., routing) with
实现自动驾驶网络的目标 [5-8]。然而,现有的基于 DRL 的解决方案在应用于不同网络场景时仍无法实现泛化 [9,10]。在这种情况下,泛化指的是 DRL 代理适应训练期间未见的新网络场景(如网络拓扑、配置)的能力。
the goal of enabling self-driving networks [5-8]. However, existing DRL-based solutions still fail to generalize when applied to different network scenarios [9,10]. In this context, generalization refers to the ability of the DRL agent to adapt to new network scenarios not seen during training (e.g., network topologies, configurations).
我们认为,通用化是在生产网络中成功采用 DRL 技术的基本属性。如果没有通用性,就必须在部署 DRL 的同一网络中对 DRL 解决方案进行培训,而这在一般情况下是不可能的,也是负担不起的。训练 DRL 代理是一个非常昂贵和漫长的过程。它通常需要强大的计算能力和网络仪器来观察其性能(如延迟、抖动)。此外,DRL 代理在训练过程中做出的决定可能会导致性能下降甚至服务中断。因此,在客户网络中训练 DRL 代理可能是不可行的。
We argue that generalization is an essential property for the successful adoption of DRL technologies in production networks. Without generalization, DRL solutions should be trained in the same network where they are deployed, which is not possible or affordable in general. To train a DRL agent is a very costly and lengthy process. It often requires significant computing power and instrumentation of the network to observe its performance (e.g., delay, jitter). Additionally, decisions made by a DRL agent during training can lead to degraded performance or even to service disruption. Thus, training a DRL agent in the customer’s network may be unfeasible.
通过泛化,DRL 代理可以使用多种具有代表性的网络拓扑和配置进行训练。之后,它还可以应用于其他拓扑和配置,只要它们
With generalization, a DRL agent can be trained with multiple, representative network topologies and configurations. Afterwards, it can be applied to other topologies and configurations, as long as they
共享一些共同属性。这样的 "通用 "模型可以在实验室中进行培训,然后再集成到产品或网络设备(如路由器、负载平衡器)中。由此产生的解决方案可随时部署到生产网络中,而无需在客户网络中进行任何进一步的培训或仪器操作。 1 1 ^(1){ }^{1}
share some common properties. Such a “universal” model can be trained in a laboratory and later on be incorporated in a product or a network device (e.g., router, load balancer). The resulting solution would be ready to be deployed to a production network without requiring any further training or instrumentation in the customer network. 1 1 ^(1){ }^{1}
遗憾的是,现有的网络 DRL 提议都是针对在训练过程中看到的相同网络拓扑结构而设计的 [9,11,12],因此限制了它们在生产网络中的潜在部署。造成这种严重限制的主要原因是,计算机网络从根本上说是以图形表示的。例如,网络拓扑结构和路由策略通常就是这样表示的。然而,最先进的建议 [11,13-15] 使用传统的神经网络(NN)架构(如全连接、卷积),而这些架构并不适合对图结构信息建模 [16]。
Unfortunately, existing DRL proposals for networking were designed to operate in the same network topology seen during training [9,11, 12], thereby limiting their potential deployment on production networks. The main reason behind this strong limitation is that computer networks are fundamentally represented as graphs. For instance, the network topology and routing policy are typically represented as such. However, state-of-the-art proposals [11,13-15] use traditional neural network (NN) architectures (e.g., fully connected, convolutional) that are not well suited to model graph-structured information [16].
在本文中,我们将图神经网络(GNN)[17] 集成到 DRL 代理中,以解决网络优化问题。特别是,我们的架构旨在解决光网络中的路由优化问题,并可泛化到从未见过的任意拓扑结构。我们的 DRL 代理中集成的 GNN 受到了消息传递神经网络(MPNN)的启发,MPNN 已成功应用于解决相关的化学问题[18]。在我们的案例中,GNN 专门设计用于捕捉有关链接之间关系的有意义信息以及流经网络拓扑结构的流量。
In this paper, we integrate Graph Neural Networks (GNN) [17] into DRL agents to solve network optimization problems. Particularly, our architecture is intended to solve routing optimization in optical networks and to generalize over never-seen arbitrary topologies. The GNN integrated in our DRL agent is inspired by Message-passing Neural Networks (MPNN), which were successfully applied to solve a relevant chemistry-related problem [18]. In our case, the GNN was specifically designed to capture meaningful information about the relations between the links and the traffic flowing through the network topologies.
评估结果表明,与最先进的 DRL(SoA DRL)算法[15]相比,我们的代理具有很强的泛化能力。此外,为了进一步测试所提出的基于 DRL 架构的泛化能力,我们在一组 232 种不同的真实世界网络拓扑结构中对其进行了评估。结果表明,所提出的 DRL+GNN 架构能够在训练过程中从未见过的网络上实现出色的性能。最后,我们探讨了我们架构的泛化限制,并讨论了其可扩展性。
The evaluation results show that our agent achieves a strong generalization capability compared to state-of-the-art DRL (SoA DRL) algorithms [15]. Additionally, to further test the generalization capability of the proposed DRL-based architecture, we evaluated it in a set with 232 different real-world network topologies. The results show that the proposed DRL+GNN architecture is able to achieve outstanding performance over the networks never seen during training. Finally, we explore the generalization limitations of our architecture and discuss its scalability properties.
总的来说,我们用于网络优化的 DRL+GNN 架构具有以下特点:
Overall, our DRL+GNN architecture for network optimization has the following features:
  • 通用性:它能在训练过程中从未见过的网络拓扑和场景中有效工作。
    Generality: It can work effectively in network topologies and scenarios never seen during training.
  • 可部署性:无需培训,也无需在客户网络中安装仪器,即可将其部署到生产网络中。
    Deployability: It can be deployed to production networks without requiring training nor instrumentation in the customer network.
  • 开销低:DRL 代理经过训练后,只需一步就能做出路由决策( m s m s ~~ms\approx m s ),而其成本则与网络规模成线性关系。
    Low overhead: Once trained, the DRL agent can make routing decisions in only one step ( m s m s ~~ms\approx m s ), while its cost scales linearly with the network size.
  • 商业化:网络供应商可轻松将其嵌入网络设备或产品,并成功运行 "任意 "网络。
    Commercialization: Network vendors can easily embed it in network devices or products, and successfully operate “arbitrary” networks.
我们相信,将这些功能结合在一起,可以开发出基于 DRL 的新一代网络解决方案,与目前基于启发式或线性优化的方法相比,这些解决方案更具成本效益。实验中使用的所有拓扑和脚本,以及 DRL+GNN 代理的源代码均可公开获取 [19]。
We believe the combination of these features can enable the development of a new generation of networking solutions based on DRL that are more cost-effective than current approaches based on heuristics or linear optimization. All the topologies and scripts used in the experiments, as well as the source code of our DRL+GNN agent are publicly available [19].

2.背景情况  2. Background

本文提出的解决方案结合了两种机器学习机制。首先,我们使用 GNN 对计算机网络场景进行建模。GNN 是一种神经网络架构,专为泛化图结构数据而设计[16]。此外,它们还能以毫秒为单位提供近乎实时的操作(见第 6.2 节)。其次,我们使用深度强化学习(Deep Reinforcement Learning)来构建一个代理,它可以学习
The solution proposed in this paper combines two machine learning mechanisms. First, we use a GNN to model computer network scenarios. GNNs are neural network architectures specifically designed to generalize over graph-structured data [16]. In addition, they offer near real-time operation in the scale of milliseconds (see Section 6.2). Second, we use Deep Reinforcement Learning to build an agent that learns
如何按照特定优化目标高效运行网络。DRL 将过去优化过程中获得的知识应用到以后的决策中,而无需运行计算密集型算法。
how to efficiently operate networks following a particular optimization goal. DRL applies the knowledge obtained in past optimizations to later decisions, without the necessity to run computationally intensive algorithms.

2.1.图神经网络
2.1. Graph Neural Networks

图神经网络是一种新型的神经网络,设计用于在图上运行。图神经网络在 [17] 中被提出,自 [20,21] 以来已开发出许多变体。其基本形式是将一些初始状态与输入图中的不同元素相关联,并根据这些元素在图中的连接方式将它们组合起来。迭代算法更新元素的状态,并利用生成的状态产生输出。要解决的问题的特殊性将决定哪种 GNN 变体更合适,这取决于所涉及的图元素(即节点和边)的性质等。
Graph Neural Networks are a novel family of neural networks designed to operate over graphs. They were introduced in [17] and numerous variants have been developed since [20,21]. In their basic form, they consist of associating some initial states to the different elements of an input graph, and combining them considering how these elements are connected in the graph. An iterative algorithm updates the elements’ state and uses the resulting states to produce an output. The particularities of the problem to solve will determine which GNN variant is more suitable, depending on, for instance, the nature of the graph elements (i.e., nodes and edges) involved.
消息传递神经网络 (MPNN) [18] 是一种著名的 GNN,它采用迭代消息传递算法在图节点之间传播信息。在信息传递步骤中,每个节点 k k kk 都会接收来自其邻域(以 N ( k ) N ( k ) N(k)N(k) 表示)中所有节点的信息。消息是通过对图中节点对的隐藏状态应用消息函数 m ( ) m ( ) m(*)m(\cdot) 生成的。然后,通过聚合函数(例如和)将它们合并(式 (1))。最后,使用更新函数 u ( ) u ( ) u(*)u(\cdot) 为每个节点计算新的隐藏状态(式 (2))。
Message Passing Neural Networks (MPNN) [18] are a well-known type of GNNs that apply an iterative message-passing algorithm to propagate information between the nodes of the graph. In a messagepassing step, each node k k kk receives messages from all the nodes in its neighborhood, denoted by N ( k ) N ( k ) N(k)N(k). Messages are generated by applying a message function m ( ) m ( ) m(*)m(\cdot) to the hidden states of node pairs in the graph. Then, they are combined by an aggregation function, for instance, a sum (Eq. (1)). Finally, an update function u ( ) u ( ) u(*)u(\cdot) is used to compute a new hidden state for every node (Eq. (2)).
M k t + 1 = i N ( k ) m ( h k t , h i t ) M k t + 1 = i N ( k ) m h k t , h i t M_(k)^(t+1)=sum_(i in N(k))m(h_(k)^(t),h_(i)^(t))M_{k}^{t+1}=\sum_{i \in N(k)} m\left(h_{k}^{t}, h_{i}^{t}\right)
h k t + 1 = u ( h k t , M k t + 1 ) h k t + 1 = u h k t , M k t + 1 h_(k)^(t+1)=u(h_(k)^(t),M_(k)^(t+1))h_{k}^{t+1}=u\left(h_{k}^{t}, M_{k}^{t+1}\right)
其中函数 m ( ) m ( ) m(*)m(\cdot) u ( ) u ( ) u(*)u(\cdot) 可以通过神经网络学习。经过一定次数的迭代后,最终的节点状态会被一个读出函数 r ( ) r ( ) r(*)r(\cdot) 使用,以产生给定任务的输出。该函数也可由神经网络实现,其任务通常是预测单个节点的属性(如节点的类别)或图的全局属性。
Where functions m ( ) m ( ) m(*)m(\cdot) and u ( ) u ( ) u(*)u(\cdot) can be learned by neural networks. After a certain number of iterations, the final node states are used by a readout function r ( ) r ( ) r(*)r(\cdot) to produce an output for the given task. This function can also be implemented by a neural network and is typically tasked to predict properties of individual nodes (e.g., the node’s class) or global properties of the graph.
GNN 能够在多个领域取得相关的性能结果,而这些领域的数据通常结构为图 [18,22]。由于计算机网络从根本上被表示为图,因此与传统的神经网络架构(如全连接网络、卷积网络等)相比,GNN 在网络建模方面具有与生俱来的优势。
GNNs have been able to achieve relevant performance results in multiple domains where data is typically structured as a graph [18,22]. Since computer networks are fundamentally represented as graphs, it is inherent in their design that GNNs offer unique advantages for network modeling compared to traditional neural network architectures (e.g., fully connected NN, Convolutional NN, etc.).

2.2.深度强化学习
2.2. Deep Reinforcement Learning

DRL 算法旨在学习一种长期策略,从而最大化优化问题中的目标函数。DRL 代理从初始状态开始,通过探索状态和行动空间的迭代过程学习最优策略。这些空间由一组状态 ( S ) ( S ) (S)(S) 和一组行动 ( A ) ( A ) (A)(\mathcal{A}) 表示。给定一个状态 s s ss S S inS\in \mathcal{S} ,代理将执行一个行动 a A a A a inAa \in \mathcal{A} ,该行动会产生一个到新状态 s S s S s^(')in Ss^{\prime} \in S 的过渡,并为代理提供奖励 r r rr 。那么,我们的目标就是找到一种策略,使一集结束时的累积奖励最大化。一集结束的定义取决于要解决的优化问题。
DRL algorithms aim at learning a long-term strategy that leads to maximize an objective function in an optimization problem. DRL agents start from a tabula rasa state and they learn the optimal strategy by an iterative process that explores the state and action spaces. These are denoted by a set of states ( S ) ( S ) (S)(S) and a set of actions ( A ) ( A ) (A)(\mathcal{A}). Given a state s s ss S S inS\in \mathcal{S}, the agent will perform an action a A a A a inAa \in \mathcal{A} that produces a transition to a new state s S s S s^(')in Ss^{\prime} \in S, and will provide the agent with a reward r r rr. Then, the objective is to find a strategy that maximizes the cumulative reward by the end of an episode. The definition of the end of an episode depends on the optimization problem to address.
Q-learning [23] 是一种 RL 算法,其目标是让代理学习一个策略 π : S A π : S A pi:S rarrA\pi: S \rightarrow \mathcal{A} 。该算法创建一个表(又称 Qtable),其中包含所有可能的状态和行动组合。在训练开始时,该表被初始化(例如,用零或随机值),在训练过程中,代理会根据选择行动后获得的奖励更新这些值。这些值被称为 q 值,代表从状态 s s ss 开始应用行动 a a aa 后的预期累积奖励,假设代理在剩余的事件中遵循当前策略 π π pi\pi 。在训练过程中, q q qq 值使用贝尔曼方程更新(见公式 (3)),其中 Q ( s t , a t ) Q s t , a t Q(s_(t),a_(t))Q\left(s_{t}, a_{t}\right) 是时间步 q q qq 值函数, t , α t , α t,alphat, \alpha 是学习率, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) 是时间步 q q qq 值函数, t , α t , α t,alphat, \alpha 是学习率, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) 是学习率, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) 是学习率。
Q-learning [23] is a RL algorithm whose goal is to make an agent learn a policy π : S A π : S A pi:S rarrA\pi: S \rightarrow \mathcal{A}. The algorithm creates a table (a.k.a., qtable) with all the possible combinations of states and actions. At the beginning of the training, the table is initialized (e.g., with zeros or random values) and during training, the agent updates these values according to the rewards obtained after selecting an action. These values, called q-values, represent the expected cumulative reward after applying action a a aa from state s s ss, assuming that the agent follows the current policy π π pi\pi during the rest of the episode. During training, q q qq-values are updated using the Bellman equation (see Eq. (3)) where Q ( s t , a t ) Q s t , a t Q(s_(t),a_(t))Q\left(s_{t}, a_{t}\right) is the q q qq-value function at time-step t , α t , α t,alphat, \alpha is the learning rate, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) is the
从状态 s t s t s_(t)s_{t} 中选择行动 a t a t a_(t)a_{t} 所获得的奖励, γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1] 是贴现因子。
reward obtained from selecting action a t a t a_(t)a_{t} from state s t s t s_(t)s_{t} and γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1] is the discount factor.
Q ( s t , a t ) = Q ( s t , a t ) + α ( r ( s t , a t ) + γ max a Q ( s t , a ) Q ( s t , a t ) ) Q s t , a t = Q s t , a t + α r s t , a t + γ max a Q s t , a Q s t , a t {:[Q(s_(t),a_(t))=Q(s_(t),a_(t))+alpha(r(s_(t),a_(t))+:}],[{: gammamax_(a^('))Q(s_(t)^('),a^('))-Q(s_(t),a_(t)))]:}\begin{array}{r} Q\left(s_{t}, a_{t}\right)=Q\left(s_{t}, a_{t}\right)+\alpha\left(r\left(s_{t}, a_{t}\right)+\right. \\ \left.\gamma \max _{a^{\prime}} Q\left(s_{t}^{\prime}, a^{\prime}\right)-Q\left(s_{t}, a_{t}\right)\right) \end{array}
深度 Q 网络(DQN)[24] 是一种基于 Q 学习的更先进算法,它使用深度神经网络(DNN)来逼近 q q qq 值函数。随着 Q 表的大小变大,Q-learning 在从高维状态和行动空间学习策略时会遇到困难。为了克服这个问题,他们建议使用 DNN 作为 q q qq 值函数估计器,依靠 DNN 的泛化能力来估计事先未见的状态和行动的 q 值。因此,一个非常适合理解和概括 DRL 代理输入数据的 DNN 对于代理在面对从未见过的状态(或环境)时取得良好表现至关重要。此外,DQN 还使用经验重放缓冲区来存储过去的连续经验(即存储 { s , a , r , s } s , a , r , s {s,a,r,s^(')}\left\{s, a, r, s^{\prime}\right\} 的元组)。
Deep Q-Network (DQN) [24] is a more advanced algorithm based on Q-learning that uses a Deep Neural Network (DNN) to approximate the q q qq-value function. As the q -table size becomes larger, Q -learning faces difficulties to learn a policy from high dimensional state and action spaces. To overcome this problem, they proposed to use a DNN as a q q qq-value function estimator, relying on the generalization capabilities of DNNs to estimate the q -values of states and actions unseen in advance. For this reason, a NN well suited to understand and generalize over the input data of the DRL agent is crucial for the agents to perform well when facing states (or environments) never seen before. Additionally, DQN uses an experience replay buffer to store past sequential experiences (i.e., stores tuples of { s , a , r , s } s , a , r , s {s,a,r,s^(')}\left\{s, a, r, s^{\prime}\right\} ).

3.网络优化方案
3. Network optimization scenario

本文探讨了基于 GNN 的 DRL 代理解决光传输网络 (OTN) 路由问题的潜力。特别是,我们考虑了基于软件定义网络(Software-Defined Networking)的网络场景,在该场景中,DRL 代理(位于控制平面)对当前网络状态具有全局视图,并且必须在每个流量需求到达时做出路由决策。我们将流量需求视为从源节点发送到目的节点的流量。过去几十年来,光网络界一直在研究这种相关的优化方案,并提出了许多解决方案 [11,15,25]。
In this paper, we explore the potential of a GNN-based DRL agent to address the routing problem in Optical Transport Networks (OTN). Particularly, we consider a network scenario based on Software-Defined Networking, where the DRL agent (located in the control plane) has a global view of the current network state, and has to make routing decisions on every traffic demand as it arrives. We consider a traffic demand as the volume of traffic sent from a source to a destination node. This is a relevant optimization scenario that has been studied in the last decades in the optical networking community, where many solutions have been proposed [11,15,25].
在我们的 OTN 场景中,DRL 代理通过逻辑拓扑在电域做出路由决策,其中节点代表可重构光分插复用器 (ROADM),边缘是连接节点的预定义光路(见图 1)。DRL 代理接收不同带宽要求的流量需求,这些带宽要求由元组 { s r c , d s t { s r c , d s t {src,dst\{s r c, d s t 、带宽 } } }\} 定义,它必须为每个需求选择一条端到端路径。特别是,端到端路径被定义为连接需求源和目的地的光路序列。由于代理在电域运行,流量需求被定义为光数据单元(ODUk)请求,其带宽要求由 ITU-T 建议 G. 709 [26]定义。然后,ODUk 信号被复用为光传输单元(OTUk),即包括前向纠错在内的数据帧。最终,OTUk 帧被映射到拓扑结构光路中的不同光信道上。
In our OTN scenario, the DRL agent makes routing decisions at the electrical domain, over a logical topology where nodes represent Reconfigurable Optical Add-Drop Multiplexers (ROADM) and edges are predefined lightpaths connecting them (see Fig. 1). The DRL agent receives traffic demands with different bandwidth requirements defined by the tuple { s r c , d s t { s r c , d s t {src,dst\{s r c, d s t, bandwidth } } }\}, and it has to select an end-to-end path for every demand. Particularly, end-to-end paths are defined as sequences of lightpaths connecting the source and destination of a demand. Since the agent operates at the electrical domain, traffic demands are defined as requests of Optical Data Units (ODUk), whose bandwidth requirements are defined in the ITU-T Recommendation G. 709 [26]. The ODUk signals are then multiplexed into Optical Transport Units (OTUk), which are data frames including Forward Error Correction. Eventually, OTUk frames are mapped to different optical channels within the lightpaths of the topology.
在这种情况下,路由选择问题被定义为为每个输入源-目的地流量需求寻找最佳路由选择策略。学习过程以目标函数为指导,该目标函数旨在使网络中分配的流量长期最大化。我们认为,如果构成所选端到端路径的所有光路都有足够的可用容量,那么该需求就是分配得当的。请注意,光路是逻辑拓扑中代理运行的边缘。需求不会过期,在 DRL 事件结束前都会占用光路。这对代理来说是一项具有挑战性的任务,因为它不仅要识别网络上的关键资源(如潜在瓶颈),还要处理未来流量需求产生的不确定性。以下约束条件概括了 OTN 场景中的流量需求路由问题:
In this scenario, the routing problem is defined as finding the optimal routing policy for each incoming source-destination traffic demand. The learning process is guided by an objective function that aims to maximize the traffic volume allocated in the network in the long-term. We consider that a demand is properly allocated if there is enough available capacity in all the lightpaths forming the end-to-end path selected. Note that lightpaths are the edges in the logical topology where the agent operates. The demands do not expire, occupying the lightpaths until the end of a DRL episode. This implies a challenging task for the agent, since it has not only to identify critical resources on networks (e.g., potential bottlenecks), but also to deal with the uncertainty in the generation of future traffic demands. The following constraints summarize the traffic demand routing problem in the OTN scenario:
  • 代理必须为每一个传入的流量需求做出有序的路由决策
    The agent must make sequential routing decisions for every incoming traffic demand
  • 流量需求不能通过多条路径分流
    Traffic demands cannot be split over multiple paths
  • 之前的流量需求无法重新路由,它们占用了链路的容量,直到事件结束
    Previous traffic demands cannot be rerouted and they occupy the links’ capacities until the end of the episode
图 1.OTN 路由场景中 DRL 代理的示意图。
Fig. 1. Schematic representation of the DRL agent in the OTN routing scenario.
通过求解马尔可夫决策过程(MDP),可以找到 OTN 优化问题的最优解 [27]。为此,我们可以使用动态编程(Dynamic Programming)等技术,对所有 MDP 状态进行迭代,直至收敛。流量需求分配问题的 MDP 由所有可能的网络拓扑状态和状态之间的转换概率组成。请注意,在我们的方案中,从一个状态到下一个状态的转换概率是统一的。优化求解 MDP 的一个局限性是,对于大型复杂的优化问题来说,这种方法并不可行。随着问题规模的增大,MDP 的状态空间也会增大,其空间复杂度(以状态数计)为 S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right) ,其中 N N NN 为链路可能具有的不同容量数, E E EE 为链路数。因此,为了求解 MDP,算法将花费更多的时间来遍历 MDP 的所有状态。
The optimal solution to the OTN optimization problem can be found by solving its Markov Decision Process (MDP) [27]. To do this, we can use techniques such as Dynamic Programming, which consist of an iterative process over all MDP’s states until convergence. The MDP for the traffic demand allocation problem consists of all the possible network topology states and the transition probabilities between states. Notice that in our scenario we have uniform transition probabilities from one state to the next. One limitation of solving MDPs optimally is that it becomes infeasible for large and complex optimization problems. As the problem size grows, so does the MDP’s state space, where the space complexity (in number of states) is S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right), having N N NN as the number of different capacities a link can have and E E EE as the number of links. Therefore, to solve the MDP the algorithm will spend more time on iterating over all MDP’s states.

4.基于 GNN 的 DRL 代理设计
4. GNN-based DRL agent design

本节将介绍本文提出的 DRL+GNN 架构。一方面,我们有基于 GNN 的 DRL 代理,它定义了要在网络拓扑上应用的操作。这些操作包括在其中一条候选路径上分配需求。我们的 DRL 代理实现了 DQN 算法 [24],其中 q q qq 值函数由 GNN 建模。另一方面,我们有一个环境来定义要解决的优化问题。该环境存储网络拓扑以及链路特征。此外,环境还负责在执行一项操作后生成奖励,这将向代理表明该操作是好是坏。
In this section, we describe the DRL+GNN architecture proposed in this paper. On one side, we have the GNN-based DRL agent which defines the actions to apply on the network topology. These actions consist of allocating the demands on one of the candidate paths. Our DRL agent implements the DQN algorithm [24], where the q q qq-value function is modeled by a GNN. On the other side, we have an environment which defines the optimization problem to solve. This environment stores the network topology, together with the link features. In addition, the environment is responsible of generating the reward once an action is performed, which will indicate the agent if that action was good or not.
学习过程基于一个迭代过程,在每一个时间步骤中,代理都会从环境中接收到一个图结构网络状态观测值。然后,GNN 构建一个图表示,拓扑结构中的链接就是图实体。在这种表示法中,链接隐藏状态的初始化要考虑输入的链接级特征和要评估的路由行动(详见第 4.1-4.3 节)。在这种表示方法中,会根据图结构在链接隐藏状态之间运行迭代信息传递算法。该算法的输出(即新的链接隐藏状态)被汇总到一个全局隐藏状态中,该状态编码拓扑信息,然后由 DNN 进行处理。由于全局隐藏状态的长度是预定义的,因此不同的拓扑大小,全局隐藏状态的长度总是相同的。在信息传递阶段结束时,GNN 会输出一个 q q qq 值估计。这个 q q qq 值会在有限的行动集合中进行评估,最后 DRL 代理会选择 q q qq 值最高的行动。
The learning process is based on an iterative process, where at each time step, the agent receives a graph-structured network state observation from the environment. Then, the GNN constructs a graph representation where the links of the topology are the graph entities. In this representation, the link hidden states are initialized considering the input link-level features and the routing action to evaluate (see more details in Sections 4.1-4.3). With this representation, an iterative message passing algorithm runs between the links’ hidden states according to the graph structure. The output of this algorithm (i.e., new links hidden states) is aggregated into a global hidden state, that encodes topology information, and then is processed by a DNN. This process makes the GNN topology invariant because the global hidden state length is pre-defined and it will always have the same length for different topology sizes. At the end of the message passing phase, the GNN outputs a q q qq-value estimate. This q q qq-value is evaluated over a limited set of actions, and finally the DRL agent selects the action with highest q q qq-value.

4.1.环境  4.1. Environment

网络状态由拓扑链路特征定义,其中包括链路容量和链路间度。前者表示
The network state is defined by the topology links’ features, which include the link capacity and link betweenness. The former indicates
图 2.链接隐藏状态中的动作表示。
Fig. 2. Action representation in the link hidden states.
链路的可用容量。后者是从图论中继承下来的一种中心性度量,表示有多少条路径可能穿越该链路。从实验结果中我们发现,这一特性有助于减少 DRL 代理在调整超参数时的网格搜索。这是因为间度有助于代理更快地收敛到良好的策略。具体来说,我们采用以下方法计算链路间度:对于拓扑中的每一对节点,我们计算 k k kk 条候选路径(例如 k k kk 条最短路径),并维护每个链路的计数器,以显示有多少条路径经过该链路。因此,每个链路上的间隔度就是穿过该链路的端到端路径数除以路径总数。
the amount of capacity available on the link. The latter is a measure of centrality inherited from graph theory that indicates how many paths may potentially traverse the link. From the experimental results we observed that this feature helps reduce the grid search of the hyperparameter tuning for the DRL agent. This is because the betweenness helps the agent converge faster to a good policy. In particular, we compute the link betweenness in the following way: for each pair of nodes in the topology, we compute k k kk candidate paths (e.g., the k k kk shortest paths), and we maintain a per-link counter that indicates how many paths pass through the link. Thus, the betweenness on each link is the number of end-to-end paths crossing the link divided by the total number of paths.

4.2.行动空间  4.2. Action space

在本节中,我们将介绍 DRL+GNN 代理如何表示路由行为。请注意,在大规模真实世界网络中,每个源-目的节点对的可能路由组合数量通常会导致一个高维度的行动空间。这使得 DRL 代理的路由选择问题变得复杂,因为它需要估计所有可能应用的行动(即路由配置)的 q 值。为了解决这个问题,必须精心设计行动空间,以降低维度。此外,为了能够推广到其他拓扑结构,行动空间在不同的拓扑结构中应该是等价的。换句话说,如果训练拓扑中的行动由最短路径表示,那么在评估拓扑中,它们也应该是最短路径。如果行动空间不同(例如,源-目的节点对之间有多条路径),那么代理在学习时就会遇到问题,也就无法很好地泛化。为了充分利用 GNN 的泛化能力,我们将行动以图的形式引入到代理中。这使得行动表示法不受节点和边缘排列的影响,也就是说,一旦 GNN 训练成功,它就能理解任意图结构(即不同网络状态和拓扑结构)上的行动。
In this section we describe how the routing actions are represented in the DRL+GNN agent. Note that the number of possible routing combinations for each source-destination node pair typically results in a high dimensional action space in large-scale real-world networks. This makes the routing problem complex for the DRL agent, since it should estimate the q-values for all the possible actions to apply (i.e., routing configurations). To overcome this problem, the action space must be carefully designed to reduce the dimensionality. In addition, to enable generalization to other topologies, the action space should be equivalent across topologies. In other words, if the actions in the training topology are represented by shortest paths, in the evaluation topology they should also be shortest paths. If the action space would be different (e.g., multiple paths between a source-destination node pair), the agent will have problems learning and it will not generalize well. To leverage the generalization capability of GNN, we introduced the action into the agent in the form of a graph. This makes the action representation invariant to node and edge permutation, which means that, once the GNN is successfully trained, it is able to understand actions over arbitrary graph structures (i.e., over different network states and topologies).
考虑到上述情况,我们将行动集限制为每个源-目的节点对的 k k kk 条候选路径。为了在路由流量的灵活性和评估所有可能行动的成本之间保持良好的平衡,我们为每个源-目的节点对选择了 k = 4 k = 4 k=4k=4 最短路径集(按跳数计算)。这遵循了 [15] 最初提出的标准。请注意,行动集因需要路由的流量需求的源节点和目的节点而异。
Considering the above, we limit the action set to k k kk candidate paths for each source-destination node pair. To maintain a good trade-off between flexibility to route traffic and the cost to evaluate all the possible actions, we selected a set with the k = 4 k = 4 k=4k=4 shortest paths (by number of hops) for each source-destination node pair. This follows a criteria originally proposed by [15]. Note that the action set differs depending on the source and destination nodes of the traffic demand to be routed.
为了表示该操作,我们将其引入网络状态中。特别是,我们考虑了一个额外的链路级特征,即应用路由操作后在链路上分配的带宽。该值与当前待分配流量请求的带宽需求相对应。同样,未包含在路由操作所选路径中的链路,其特征值将设为零。
To represent the action, we introduce it within the network state. Particularly, we consider an additional link-level feature, which is the bandwidth allocated over the link after applying the routing action. This value corresponds to the bandwidth demand of the current traffic request to be allocated. Likewise, the links that are not included in the path selected by the action will have this feature set to zero.
表 1 Table 1
链接隐藏状态的输入特征。 N N NN 对应于隐藏状态向量的大小。
Input features of the link hidden states. N N NN corresponds to the size of the hidden state vector.
符号 Notation 说明 Description
x 1 x 1 x_(1)x_{1} 链接可用容量 Link available capacity
x 2 x 2 x_(2)x_{2} 链接间隔 Link betweenness
x 3 x 3 x_(3)x_{3} 行动矢量(带宽分配)
Action vector (bandwidth allocated)
x 4 x N x 4 x N x_(4)-x_(N)x_{4}-x_{N} 零填充 Zero padding
Notation Description x_(1) Link available capacity x_(2) Link betweenness x_(3) Action vector (bandwidth allocated) x_(4)-x_(N) Zero padding| Notation | Description | | :--- | :--- | | $x_{1}$ | Link available capacity | | $x_{2}$ | Link betweenness | | $x_{3}$ | Action vector (bandwidth allocated) | | $x_{4}-x_{N}$ | Zero padding |
由于我们的 OTN 环境中流量请求的数量有限,且具有不同的离散带宽需求,因此我们用一个 N N NN 元的单击编码向量来表示分配的带宽,其中 N N NN 是向量长度。
Since our OTN environment has a limited number of traffic requests with various discrete bandwidth demands, we represent the bandwidth allocated with a N N NN-element one-hot encoding vector, where N N NN is the vector length.
图 2 展示了一个简单网络场景中链路隐藏状态下的行动表示。一个从节点 1 到节点 5 的流量请求(流量需求为 8 个带宽单位)被分配到由节点 { 1 , 2 , 3 , 5 } { 1 , 2 , 3 , 5 } {1,2,3,5}\{1,2,3,5\} 形成的路径上。表 1 概括描述了链路隐藏状态所包含的特征。这些值既代表了网络状态,也代表了行动,而行动是 q q qq 值函数 Q ( s , a ) Q ( s , a ) Q(s,a)Q(s, a) 建模所需的输入。
Fig. 2 illustrates the representation of the action in the hidden state of the links in a simple network scenario. A traffic request from node 1 to node 5, with a traffic demand of 8 bandwidth units, is allocated over the path formed by the nodes { 1 , 2 , 3 , 5 } { 1 , 2 , 3 , 5 } {1,2,3,5}\{1,2,3,5\}. To summarize, Table 1 provides a description of the features included in the links’ hidden states. These values represent both the network state and the action, which is the input needed to model the q q qq-value function Q ( s , a ) Q ( s , a ) Q(s,a)Q(s, a).
隐藏状态的大小通常大于隐藏状态中的特征数量。这是为了让每个链接都能存储自己的信息(即自己的初始特征)以及来自所有链接邻居的汇总信息(见第 4.3 节)。如果隐藏状态的大小等于链接特征的数量,链接将没有空间来存储相邻链接的信息而不丢失信息。这将导致读出函数后的图嵌入效果不佳。相反,如果状态大小非常大,就会导致 GNN 模型过大,从而对数据产生过拟合。一种常见的方法是设置状态大小大于特征数量,并用零填充向量。
The size of the hidden states is typically larger than the number of features in the hidden states. This is to enable each link to store information of himself (i.e., his own initial features) plus the aggregated information coming from all the links’ neighbors (see Section 4.3). If the hidden state size is equal to the number of link features, the links will not have space to store information about the neighboring links without losing information. This results in a poor graph embedding after the readout function. On the contrary, if the state size is very large, it can lead to a large GNN model, which can overfit to the data. A common approach is to set the state size larger than the number of features and to fill the vector with zeros.

4.3.GNN 架构  4.3. GNN architecture

GNN 模型基于消息传递神经网络 [18] 模型。在我们的案例中,我们考虑的是链路实体,并在所有链路之间执行信息传递过程。我们选择链路实体而非节点实体,是因为链路特征决定了 OTN 路由优化问题。在处理需要结合节点级特征(如 I/O 缓冲区大小、调度算法)的优化问题时,可以添加节点实体。算法 1 显示了消息传递过程的正式描述,其中算法接收链路特征 ( x l ) x l (x_(l))\left(x_{l}\right) 作为输入,并输出 q q qq 值 (q)。
The GNN model is based on the Message Passing Neural Network [18] model. In our case, we consider the link entity and perform the message passing process between all links. We choose link entities, instead of node entities, because the link features are what define the OTN routing optimization problem. Node entities could be added when addressing an optimization problem that needs to incorporate nodelevel features (e.g., I/O buffer size, scheduling algorithm). Algorithm 1 shows a formal description of the message passing process where the algorithm receives as input the links’ features ( x l ) x l (x_(l))\left(x_{l}\right) and outputs a q q qq-value (q).
该算法执行 T T TT 消息传递步骤。图 3 是一个图形表示法,其中算法迭代了所有的
The algorithm performs T T TT message passing steps. A graphical representation can be seen in Fig. 3, where the algorithm iterates over all
图 3.信息传递架构。
Fig. 3. Message passing architecture.
Algorithm 1 Message Passing
    Input : \(\mathbf{x}_{l}\)
    Output: \(\mathbf{h}_{l}^{T}, q\)
    for each \(l \in \mathcal{L}\) do
        \(h_{l}^{0} \leftarrow\left[\mathbf{x}_{l}, 0 \ldots, 0\right]\)
    for \(t=1\) to \(T\) do
        for each \(l \in \mathcal{L}\) do
            \(M_{l}^{t+1}=\sum_{i \in N(l)} m\left(h_{l}^{t}, h_{i}^{t}\right)\)
            \(h_{l}^{t+1}=u\left(h_{l}^{t}, M_{l}^{t+1}\right)\)
    \(r d t \leftarrow \sum_{l \in \mathcal{L}} h_{l}\)
    \(q \leftarrow R(r d t)\)
网络拓扑结构中的链路。对于每个链路,其特征与相邻链路的特征通过全连接的方式结合在一起,对应于图 3 中的 M。根据 GNN 术语,这些操作的输出称为信息。然后,每个链路与其相邻链路计算出的信息将通过元素求和的方式进行汇总(算法 1 中的第 5 行)。然后,使用递归网络(RNN)用新的聚合信息更新链路隐藏状态 h L K h L K h_(LK)h_{L K} (算法 1 第 6 行)。在信息传递阶段结束时,将使用元素求和法汇总得到的链接状态(算法 1 中的第 7 行)。结果通过一个全连接 DNN 传递,该 DNN 模拟 GNN 的读出功能。后一个函数的输出是输入状态和动作的估计 q q qq 值。
links of the network topology. For each link, its features are combined with those of the neighboring links using a fully-connected, corresponding to M in Fig. 3. The outputs of these operations are called messages according to the GNN notation. Then, the messages computed for each link with their neighbors are aggregated using an element-wise sum (line 5 in Algorithm 1). Afterwards, a Recurrent NN (RNN) is used to update the link hidden states h L K h L K h_(LK)h_{L K} with the new aggregated information (line 6 in Algorithm 1). At the end of the message passing phase, the resulting link states are aggregated using an element-wise sum (line 7 in Algorithm 1). The result is passed through a fully-connected DNN which models the readout function of the GNN. The output of this latter function is the estimated q q qq-value of the input state and action.
RNN 的作用是了解链接状态在信息传递阶段的变化情况。随着链接信息在图中的传播,每个隐藏状态都会存储来自相距越来越远的链接的信息。因此,时间的概念就出现了。RNN 是一种 NN 架构,专门用于捕捉连续行为(如文本、视频、时间序列)。此外,一些 RNN 架构(如 GRU)设计用于处理大型序列(如 NLP 中的长文本句子)。具体来说,它们内部包含的门电路旨在缓解梯度消失问题,这是大型序列中常见的问题[28]。这使得 RNNs 即使在 T T TT 较大的情况下,也能在消息传递阶段学习链接的状态如何演变。
The role of the RNN is to learn how the link states change along the message passing phase. As the link information is being spread through the graph, each hidden state will store information from links that are farther and farther apart. Therefore, the concept of time appears. RNNs are a NN architecture that are tailored to capture sequential behavior (e.g., text, video, time-series). In addition, some RNN architectures (e.g., GRU) are designed to process large sequences (e.g., long text sentences in NLP). Specifically, they internally contain gates that are designed to mitigate the vanishing gradients, a common problem with large sequences [28]. This makes RNNs suitable to learn how the links’ state evolve during the message passing phase, even for large T T TT.

4.4. D R L D R L DRLD R L 代理操作
4.4. D R L D R L DRLD R L agent operation

DRL 代理通过与环境互动来运行。在算法 2 中,我们可以看到描述 DRL 代理运行的伪代码。首先,我们通过初始化所有链路特征来初始化环境 env。同时,环境会生成一个流量需求,由元组 { s r c , d s t , b w } { s r c , d s t , b w } {src,dst,bw}\{s r c, d s t, b w\} 和环境状态 s s ss 分配。我们还将累积奖励初始化为零,定义行动集大小,并创建经验重放缓冲区(agt.mem)。然后,我们执行一个 while 循环(第 3-16 行),当网络拓扑中存在无法分配的需求时,循环结束。对于 k = 4 k = 4 k=4k=4 最短路径中的每一条路径,我们都会沿构成路径的所有链接分配需求,并计算出 q q qq 值(第 7-9 行)。得到每个状态-行动对的 q q qq 值后,我们将使用 ϵ ϵ epsilon\epsilon 贪婪探索策略(第 10 行)[24],选择下一个要应用的行动 a a aa 。然后将该操作应用于环境,从而产生新的状态 s s s^(')s^{\prime} 、奖励 r r rr 和标志 Done,表明是否存在某些链路没有足够的容量来支持需求。此外,环境还会返回一个新的流量需求元组 { s r c , d s t , b w } s r c , d s t , b w {src^('),dst^('),bw^(')}\left\{s r c^{\prime}, d s t^{\prime}, b w^{\prime}\right\} 。有关状态转换的信息存储在经验重放缓冲区中(第 13 行)。这些信息将在以后的 agt.replay() 调用(第 15 行)中用于训练 GNN,该调用每 M M MM 次训练迭代执行一次。
The DRL agent operates by interacting with the environment. In Algorithm 2 we can observe a pseudocode describing the DRL agent operation. At the beginning, we initialize the environment env by initializing all the link features. At the same time, the environment generates a traffic demand to be allocated by the tuple { s r c , d s t , b w } { s r c , d s t , b w } {src,dst,bw}\{s r c, d s t, b w\} and an environment state s s ss. We also initialize the cumulative reward to zero, define the action set size and create the experience replay buffer (agt.mem). Afterwards, we execute a while loop (lines 3-16) that finishes when there is some demand that cannot be allocated in the network topology. For each of the k = 4 k = 4 k=4k=4 shortest paths, we allocate the demand along all the links forming the path and compute a q q qq-value (lines 7-9). Once we have the q q qq-value for each state-action pair, the next action a a aa to apply is selected using an ϵ ϵ epsilon\epsilon-greedy exploration strategy (line 10) [24]. The action is then applied to the environment, leading to a new state s s s^(')s^{\prime}, a reward r r rr and a flag Done indicating if there is some link without enough capacity to support the demand. Additionally, the environment returns a new traffic demand tuple { s r c , d s t , b w } s r c , d s t , b w {src^('),dst^('),bw^(')}\left\{s r c^{\prime}, d s t^{\prime}, b w^{\prime}\right\}. The information about the state transition is stored in the experience replay buffer (line 13). This information will be used later on to train the GNN in the agt.replay() call (line 15), which is executed every M M MM training iterations.
Algorithm 2 DRL Agent operation
    \(s, s r c, d s t, b w \leftarrow e n v . i n i t \_e n v()\)
    reward \(\leftarrow 0, k \leftarrow 4\), agt.mem \(\leftarrow\}\), Done \(\leftarrow\) False
    while not Done do
        \(k_{-} q_{-} v a l u e s \leftarrow\{ \}\)
        \(k_{-}\)shortest_paths \(\leftarrow\) compute_ \(k\) _paths \((k, s r c, d s t)\)
        for \(i\) in \(0, \ldots, k\) do
            \(p^{\prime} \leftarrow\) get_path(i, \(k_{-}\)shortest_paths)
            \(s^{\prime} \leftarrow e n v . a l l o c \_d e m a n d(s, p, s r c, d s t, d e m)\)
            \(k_{-} q_{-} v a l u e s[i] \leftarrow\) compute_q_value \(\left(s^{\prime}, p^{\prime}\right)\)
        \(q_{-}\)value \(\leftarrow e p s i l o n_{-} \quad\) reedy \(\left(k_{-} q_{-}\right.\)values, \(\epsilon\) )
        \(a \leftarrow\) get_action(q_value, \(k_{-}\)shortest_paths, \(s\) )
        \(r\), Done, \(s^{\prime}, s r c^{\prime}, d s t^{\prime}, b w^{\prime} \leftarrow e n v . s t e p(s, a)\)
        \(a g t . r m b\left(s, s r c, d s t, b w, a, r, s^{\prime}, s r c^{\prime}, d s t^{\prime}, b w^{\prime}\right)\)
        reward \(\leftarrow\) reward \(+r\)
        If training steps \(\% M=0\) : agt.replay()
        \(s r c \leftarrow s c^{\prime} ; d s t \leftarrow d s t^{\prime} ; b w \leftarrow b w^{\prime}, s \leftarrow s^{\prime}\)

5.实验结果  5. Experimental results

在本节中,我们将对基于 GNN 的 DRL 代理进行评估,以优化第 3 节所述 OTN 场景中的路由配置。特别是,本节的实验将重点评估所提出的 DRL+GNN 代理的性能和泛化能力。随后,在第 6 节中,我们分析了解决方案的可扩展性,并讨论了与生产网络部署相关的其他方面。
In this section we evaluate our GNN-based DRL agent to optimize the routing configuration in the OTN scenario described in Section 3. Particularly, the experiments in this section are focused on evaluating the performance and generalization capabilities of the proposed DRL+GNN agent. Afterwards, in Section 6, we analyze the scalability properties of our solution and discuss other relevant aspects related to the deployment on production networks.

5.1.评估设置  5.1. Evaluation setup

我们使用 Tensorflow [29] 实现了第 4 节所述的 DRL+GNN 解决方案,并在使用 OpenAI Gym 框架 [30] 实现的 OTN 网络模拟器上对其进行了评估。源代码以及所有训练和评估结果均可公开获取 [19]。
We implemented the DRL+GNN solution described in Section 4 with Tensorflow [29] and evaluated it on an OTN network simulator implemented using the OpenAI Gym framework [30]. The source code, together with all the training and evaluation results are publicly available [19].
在 OTN 模拟器中,我们考虑了三种流量需求类型(ODU2、ODU3 和 ODU4),其带宽需求以 ODU0 信号的倍数表示(即分别为 8、32 和 64 个 ODU0 带宽单位)[26]。当 DRL 代理正确分配一个需求时,如果该需求被正确分配,它将立即获得当前流量需求带宽的奖励,否则奖励为 0。如果 DRL 代理选择的路径上的所有链路都有足够的可用容量来承载该需求,我们就认为该需求分配成功。同样,如果流量需求没有被正确分配,则事件结束。流量需求是通过统一选择源-目的节点对和流量需求类型(ODUk)产生的。这就给 DRL 代理带来了更大的困难,因为统一的流量分布阻碍了预测系统的利用,无法预测可能出现的难以分配的需求。换句话说,所有流量需求在未来出现的可能性都是相同的,这就增加了 DRL 代理估算未来预期回报的难度。
In the OTN simulator, we consider three traffic demand types (ODU2, ODU3, and ODU4), whose bandwidth requirements are expressed in terms of multiples of ODU0 signals (i.e., 8,32 , and 64 ODU0 bandwidth units respectively) [26]. When the DRL agent correctly allocates a demand, it receives an immediate reward being the bandwidth of the current traffic demand if it was properly allocated, otherwise the reward is 0 . We consider that a demand is successfully allocated if all the links in the path selected by the DRL agent have enough available capacity to carry such demand. Likewise, episodes end when a traffic demand was not correctly allocated. Traffic demands are generated by uniformly selecting a source-destination node pair and a traffic demand type (ODUk). This makes the problem even more difficult for the DRL agent, since the uniform traffic distribution hinders the exploitation of prediction systems to anticipate possible demands difficult to allocate. In other words, all traffic demands are equally probable to appear in the future, making it more difficult for the DRL agent to estimate the expected future rewards.
我们进行了初步实验,以选择合适的基于梯度的优化算法,并为 DRL+GNN 代理找到超参数值。对于 GNN 模型,我们将链接的隐藏状态 h l h l h_(l)h_{l} 定义为 27 元素向量(填充表 1 中描述的特征)。请注意,隐藏状态的大小与它们可能编码的信息量有关。较大的网络拓扑结构和复杂的网络优化方案可能需要更大的隐藏状态向量。在 GNN 的每次前向传播中,我们都使用 32 个样本的批次执行 T = 7 T = 7 T=7T=7 消息传递步骤。使用的优化器是随机梯度下降算法 [31],学习率为 10 4 10 4 10^(-4)10^{-4} ,动量为 0.9。我们开始
Initial experiments were carried out to choose an appropriate gradi-ent-based optimization algorithm and to find the hyperparameter values for the DRL+GNN agent. For the GNN model, we defined the links’ hidden states h l h l h_(l)h_{l} as 27 -element vectors (filled with the features described in Table 1). Note that the size of the hidden states is related to the amount of information they may potentially encode. Larger network topologies and complex network optimization scenarios might need larger sizes for the hidden state vectors. In every forward propagation of the GNN we execute T = 7 T = 7 T=7T=7 message passing steps using batches of 32 samples. The optimizer used is the Stochastic Gradient Descent [31] with a learning rate of 10 4 10 4 10^(-4)10^{-4} and a momentum of 0.9 . We start the
ϵ ϵ epsilon\epsilon 采用 ϵ = 1.0 ϵ = 1.0 epsilon=1.0\epsilon=1.0 的贪婪探索策略,并在 70 次训练迭代中保持该值。之后, ϵ ϵ epsilon\epsilon 每集呈指数衰减。经验缓冲区存储了 4000 个样本,并以先进先出队列的形式实现(先进先出)。我们对读出函数应用了 l 2 l 2 l2l 2 正则化和滤除,两种情况下的系数均为 0.1。折扣系数 γ γ gamma\gamma 设置为 0.95。
ϵ ϵ epsilon\epsilon-greedy exploration strategy with ϵ = 1.0 ϵ = 1.0 epsilon=1.0\epsilon=1.0 and maintain this value during 70 training iterations. Afterwards, ϵ ϵ epsilon\epsilon decays exponentially every episode. The experience buffer stores 4000 samples and is implemented as a FIFO queue (first in, first out). We applied l 2 l 2 l2l 2 regularization and dropout to the readout function with a coefficient of 0.1 in both cases. The discount factor γ γ gamma\gamma was set to 0.95 .

5.2.方法  5.2. Methodology

我们将 DRL+GNN 代理的评估分为两组实验。在第一组实验中,我们侧重于推理我们解决方案的性能和泛化能力。为了说明问题,我们选择了两个特定的网络场景,并对其进行了广泛分析。作为基线,我们实施了 [15] 中提出的基于 DRL 的系统,这是目前最先进的 OTN 路由优化解决方案。随后,在第 6 节中,我们在真实世界的网络拓扑结构上评估了我们的解决方案,并从计算时间和泛化能力方面分析了它的可扩展性。
We divided the evaluation of our DRL+GNN agent in two sets of experiments. In the first set, we focused on reasoning about the performance and generalization capabilities of our solution. For illustration purposes, we chose two particular network scenarios and analyzed them extensively. As a baseline, we implemented the DRL-based system proposed in [15], a state-of-the-art solution for routing optimization in OTNs. Later on, in Section 6, we evaluated our solution on real-world network topologies and analyzed its scalability in terms of computation time and generalization capabilities.
要找到 OTN 优化问题的最佳 MDP 解,因其复杂性而不可行。以一个有 6 个节点和 8 条边的小型网络拓扑为例,其中链路的容量为 3 ODUO 单位,只有一种可用带宽类型(1 ODU0),有 4 种可能的操作。因此,MDP 的状态数为 5 8 6 5 1 1.17 e 7 5 8 6 5 1 1.17 e 7 5^(8**6**5**1)~~1.17 e75^{8 * 6 * 5 * 1} \approx 1.17 e 7 。要找到 MDP 的解决方案,我们可以使用动态编程算法,如值迭代。然而,这种算法求解 MDP 的时间复杂度为 O ( S 2 A ) O S 2 A O(S^(2)A)O\left(S^{2} A\right) ,其中 S S SS A A AA 分别为状态和行动的数量,以及 S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right) ,其中 N N NN 为链路可能具有的不同容量的数量,E 为链路的数量。
To find the optimal MDP solution to the OTN optimization problem is infeasible due to its complexity. Take as an example a small network topology with 6 nodes and 8 edges, where the links have capacities of 3 ODUO units, there is only one bandwidth type available (1 ODU0) and there are 4 possible actions. The resulting number of states of the MDP is 5 8 6 5 1 1.17 e 7 5 8 6 5 1 1.17 e 7 5^(8**6**5**1)~~1.17 e75^{8 * 6 * 5 * 1} \approx 1.17 e 7. To find a solution to the MDP we can use Dynamic Programming algorithms such as value iteration. However, this algorithm has a time complexity to solve the MDP of O ( S 2 A ) O S 2 A O(S^(2)A)O\left(S^{2} A\right)