这是用户在 2024-11-21 17:33 为 https://app.immersivetranslate.com/pdf-pro/1df23635-670a-4b92-8986-6f5e6e0bcf6c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

深度强化学习与图神经网络的结合:路由优化案例探索
Deep reinforcement learning meets graph neural networks: Exploring a routing optimization use case

更新 updates

Paul Almasan a, a,  ^("a, "){ }^{\text {a, }} , , José Suárez-Varela a a ^(a){ }^{\mathrm{a}} , Krzysztof Rusek b,a b,a  ^("b,a "){ }^{\text {b,a }} , Pere Barlet-Ros a a ^(a){ }^{\mathrm{a}} , Albert Cabellos-Aparicio a a ^(a){ }^{\mathrm{a}} .
Paul Almasan a, a,  ^("a, "){ }^{\text {a, }}, , José Suárez-Varela a a ^(a){ }^{\mathrm{a}}, Krzysztof Rusek b,a b,a  ^("b,a "){ }^{\text {b,a }}, Pere Barlet-Ros a a ^(a){ }^{\mathrm{a}}, Albert Cabellos-Aparicio a a ^(a){ }^{\mathrm{a}}
a a ^(a){ }^{a} 西班牙巴塞罗那加泰罗尼亚理工大学巴塞罗那神经网络中心
a a ^(a){ }^{a} Barcelona Neural Networking Center, Universitat Politècnica de Catalunya, Barcelona, Spain
b ^("b "){ }^{\text {b }} 波兰克拉科夫 AGH 科技大学电信研究所
b ^("b "){ }^{\text {b }} Institute of Telecommunications, AGH University of Science and Technology, Krakow, Poland

A R TICLE 信息  A R TICLE INFO

关键词:  Keywords:

图神经网络 Graph neural networks
深度强化学习
Deep reinforcement learning
路由  Routing
优化  Optimization

摘要  Abstract

深度强化学习(DRL)在决策和自动控制问题上取得了巨大进步。因此,DRL 是一种很有前途的技术,能有效解决自动驾驶网络中的许多相关优化问题(如路由选择)。然而,现有的基于 DRL 的网络解决方案无法实现泛化,这意味着它们在应用于训练期间未观察到的网络拓扑时无法正常运行。这种泛化能力的缺乏极大地阻碍了 DRL 技术在生产网络中的应用。这是因为最先进的基于 DRL 的网络解决方案使用的是标准神经网络(如全连接网络、卷积网络),而这些网络并不适合从图形结构的信息中学习。
Deep Reinforcement Learning (DRL) has shown a dramatic improvement in decision-making and automated control problems. Consequently, DRL represents a promising technique to efficiently solve many relevant optimization problems (e.g., routing) in self-driving networks. However, existing DRL-based solutions applied to networking fail to generalize, which means that they are not able to operate properly when applied to network topologies not observed during training. This lack of generalization capability significantly hinders the deployment of DRL technologies in production networks. This is because state-of-the-art DRL-based networking solutions use standard neural networks (e.g., fully connected, convolutional), which are not suited to learn from information structured as graphs.

在本文中,我们将图神经网络(GNN)集成到 DRL 代理中,并设计了一个针对具体问题的行动空间来实现泛化。GNN 是一种深度学习模型,其设计本身就能对不同大小和结构的图进行泛化。这使得基于 GNN 的 DRL 代理可以学习和泛化任意网络拓扑结构。我们在光网络路由优化使用案例中测试了 DRL+GNN 代理,并分别在 180 和 232 个未见过的合成网络拓扑和真实世界网络拓扑上对其进行了评估。结果表明,DRL+GNN 代理能够在训练期间从未见过的拓扑结构中超越最先进的解决方案。
In this paper, we integrate Graph Neural Networks (GNN) into DRL agents and we design a problem specific action space to enable generalization. GNNs are Deep Learning models inherently designed to generalize over graphs of different sizes and structures. This allows the proposed GNN-based DRL agent to learn and generalize over arbitrary network topologies. We test our DRL+GNN agent in a routing optimization use case in optical networks and evaluate it on 180 and 232 unseen synthetic and real-world network topologies respectively. The results show that the DRL+GNN agent is able to outperform state-of-the-art solutions in topologies never seen during training.

1.导言  1. Introduction

在过去几年中,工业进步(如工业 4.0、物联网)和社会行为的变化催生了现代网络应用的激增(如车载网络、AR/VR、实时通信),对骨干网络提出了新的要求(如高吞吐量和低延迟)。因此,网络运营商需要有效地管理网络资源,确保客户的服务质量并履行服务水平协议。这通常需要利用专家知识或利用整数线性规划(ILP)或约束编程(CP)的求解器来完成。然而,现实世界中的生产网络有数百个节点,基于 ILP 或 CP 的求解器需要大量时间来解决网络优化问题 [1,2]。此外,基于启发式的解决方案也远非最优。
In the last years, industrial advances (e.g., Industry 4.0, IoT) and changes in social behavior created a proliferation of modern network applications (e.g., Vehicular Networks, AR/VR, Real-Time Communications), imposing new requirements on backbone networks (e.g., high throughput and low latency). Consequently, network operators need to efficiently manage the network resources, ensuring the customer’s Quality of Service and fulfilling the Service Level Agreements. This is typically done using expert knowledge or solvers leveraging Integer Linear Programming (ILP) or Constraint Programming (CP). However, real-world production networks have in the order of hundreds of nodes and solvers based on ILP or CP would take a large amount of time to solve network optimization problems [1,2]. In addition, heuristic based solutions are far from being optimal.
深度强化学习(DRL)已在顺序决策和自动控制问题上显示出显著的改进[3,4]。因此,网络社区已经开始研究 DRL,将其作为网络优化(如路由)的一项关键技术,其中包括
Deep Reinforcement Learning (DRL) has shown significant improvements in sequential decision-making and automated control problems [3,4]. As a result, the network community is already investigating DRL as a key technology for network optimization (e.g., routing) with
实现自动驾驶网络的目标 [5-8]。然而,现有的基于 DRL 的解决方案在应用于不同网络场景时仍无法实现泛化 [9,10]。在这种情况下,泛化指的是 DRL 代理适应训练期间未见的新网络场景(如网络拓扑、配置)的能力。
the goal of enabling self-driving networks [5-8]. However, existing DRL-based solutions still fail to generalize when applied to different network scenarios [9,10]. In this context, generalization refers to the ability of the DRL agent to adapt to new network scenarios not seen during training (e.g., network topologies, configurations).
我们认为,通用化是在生产网络中成功采用 DRL 技术的基本属性。如果没有通用性,就必须在部署 DRL 的同一网络中对 DRL 解决方案进行培训,而这在一般情况下是不可能的,也是负担不起的。训练 DRL 代理是一个非常昂贵和漫长的过程。它通常需要强大的计算能力和网络仪器来观察其性能(如延迟、抖动)。此外,DRL 代理在训练过程中做出的决定可能会导致性能下降甚至服务中断。因此,在客户网络中训练 DRL 代理可能是不可行的。
We argue that generalization is an essential property for the successful adoption of DRL technologies in production networks. Without generalization, DRL solutions should be trained in the same network where they are deployed, which is not possible or affordable in general. To train a DRL agent is a very costly and lengthy process. It often requires significant computing power and instrumentation of the network to observe its performance (e.g., delay, jitter). Additionally, decisions made by a DRL agent during training can lead to degraded performance or even to service disruption. Thus, training a DRL agent in the customer’s network may be unfeasible.
通过泛化,DRL 代理可以使用多种具有代表性的网络拓扑和配置进行训练。之后,它还可以应用于其他拓扑和配置,只要它们
With generalization, a DRL agent can be trained with multiple, representative network topologies and configurations. Afterwards, it can be applied to other topologies and configurations, as long as they
共享一些共同属性。这样的 "通用 "模型可以在实验室中进行培训,然后再集成到产品或网络设备(如路由器、负载平衡器)中。由此产生的解决方案可随时部署到生产网络中,而无需在客户网络中进行任何进一步的培训或仪器操作。 1 1 ^(1){ }^{1}
share some common properties. Such a “universal” model can be trained in a laboratory and later on be incorporated in a product or a network device (e.g., router, load balancer). The resulting solution would be ready to be deployed to a production network without requiring any further training or instrumentation in the customer network. 1 1 ^(1){ }^{1}
遗憾的是,现有的网络 DRL 提议都是针对在训练过程中看到的相同网络拓扑结构而设计的 [9,11,12],因此限制了它们在生产网络中的潜在部署。造成这种严重限制的主要原因是,计算机网络从根本上说是以图形表示的。例如,网络拓扑结构和路由策略通常就是这样表示的。然而,最先进的建议 [11,13-15] 使用传统的神经网络(NN)架构(如全连接、卷积),而这些架构并不适合对图结构信息建模 [16]。
Unfortunately, existing DRL proposals for networking were designed to operate in the same network topology seen during training [9,11, 12], thereby limiting their potential deployment on production networks. The main reason behind this strong limitation is that computer networks are fundamentally represented as graphs. For instance, the network topology and routing policy are typically represented as such. However, state-of-the-art proposals [11,13-15] use traditional neural network (NN) architectures (e.g., fully connected, convolutional) that are not well suited to model graph-structured information [16].
在本文中,我们将图神经网络(GNN)[17] 集成到 DRL 代理中,以解决网络优化问题。特别是,我们的架构旨在解决光网络中的路由优化问题,并可泛化到从未见过的任意拓扑结构。我们的 DRL 代理中集成的 GNN 受到了消息传递神经网络(MPNN)的启发,MPNN 已成功应用于解决相关的化学问题[18]。在我们的案例中,GNN 专门设计用于捕捉有关链接之间关系的有意义信息以及流经网络拓扑结构的流量。
In this paper, we integrate Graph Neural Networks (GNN) [17] into DRL agents to solve network optimization problems. Particularly, our architecture is intended to solve routing optimization in optical networks and to generalize over never-seen arbitrary topologies. The GNN integrated in our DRL agent is inspired by Message-passing Neural Networks (MPNN), which were successfully applied to solve a relevant chemistry-related problem [18]. In our case, the GNN was specifically designed to capture meaningful information about the relations between the links and the traffic flowing through the network topologies.
评估结果表明,与最先进的 DRL(SoA DRL)算法[15]相比,我们的代理具有很强的泛化能力。此外,为了进一步测试所提出的基于 DRL 架构的泛化能力,我们在一组 232 种不同的真实世界网络拓扑结构中对其进行了评估。结果表明,所提出的 DRL+GNN 架构能够在训练过程中从未见过的网络上实现出色的性能。最后,我们探讨了我们架构的泛化限制,并讨论了其可扩展性。
The evaluation results show that our agent achieves a strong generalization capability compared to state-of-the-art DRL (SoA DRL) algorithms [15]. Additionally, to further test the generalization capability of the proposed DRL-based architecture, we evaluated it in a set with 232 different real-world network topologies. The results show that the proposed DRL+GNN architecture is able to achieve outstanding performance over the networks never seen during training. Finally, we explore the generalization limitations of our architecture and discuss its scalability properties.
总的来说,我们用于网络优化的 DRL+GNN 架构具有以下特点:
Overall, our DRL+GNN architecture for network optimization has the following features:
  • 通用性:它能在训练过程中从未见过的网络拓扑和场景中有效工作。
    Generality: It can work effectively in network topologies and scenarios never seen during training.
  • 可部署性:无需培训,也无需在客户网络中安装仪器,即可将其部署到生产网络中。
    Deployability: It can be deployed to production networks without requiring training nor instrumentation in the customer network.
  • 开销低:DRL 代理经过训练后,只需一步就能做出路由决策( m s m s ~~ms\approx m s ),而其成本则与网络规模成线性关系。
    Low overhead: Once trained, the DRL agent can make routing decisions in only one step ( m s m s ~~ms\approx m s ), while its cost scales linearly with the network size.
  • 商业化:网络供应商可轻松将其嵌入网络设备或产品,并成功运行 "任意 "网络。
    Commercialization: Network vendors can easily embed it in network devices or products, and successfully operate “arbitrary” networks.
我们相信,将这些功能结合在一起,可以开发出基于 DRL 的新一代网络解决方案,与目前基于启发式或线性优化的方法相比,这些解决方案更具成本效益。实验中使用的所有拓扑和脚本,以及 DRL+GNN 代理的源代码均可公开获取 [19]。
We believe the combination of these features can enable the development of a new generation of networking solutions based on DRL that are more cost-effective than current approaches based on heuristics or linear optimization. All the topologies and scripts used in the experiments, as well as the source code of our DRL+GNN agent are publicly available [19].

2.背景情况  2. Background

本文提出的解决方案结合了两种机器学习机制。首先,我们使用 GNN 对计算机网络场景进行建模。GNN 是一种神经网络架构,专为泛化图结构数据而设计[16]。此外,它们还能以毫秒为单位提供近乎实时的操作(见第 6.2 节)。其次,我们使用深度强化学习(Deep Reinforcement Learning)来构建一个代理,它可以学习
The solution proposed in this paper combines two machine learning mechanisms. First, we use a GNN to model computer network scenarios. GNNs are neural network architectures specifically designed to generalize over graph-structured data [16]. In addition, they offer near real-time operation in the scale of milliseconds (see Section 6.2). Second, we use Deep Reinforcement Learning to build an agent that learns
如何按照特定优化目标高效运行网络。DRL 将过去优化过程中获得的知识应用到以后的决策中,而无需运行计算密集型算法。
how to efficiently operate networks following a particular optimization goal. DRL applies the knowledge obtained in past optimizations to later decisions, without the necessity to run computationally intensive algorithms.

2.1.图神经网络
2.1. Graph Neural Networks

图神经网络是一种新型的神经网络,设计用于在图上运行。图神经网络在 [17] 中被提出,自 [20,21] 以来已开发出许多变体。其基本形式是将一些初始状态与输入图中的不同元素相关联,并根据这些元素在图中的连接方式将它们组合起来。迭代算法更新元素的状态,并利用生成的状态产生输出。要解决的问题的特殊性将决定哪种 GNN 变体更合适,这取决于所涉及的图元素(即节点和边)的性质等。
Graph Neural Networks are a novel family of neural networks designed to operate over graphs. They were introduced in [17] and numerous variants have been developed since [20,21]. In their basic form, they consist of associating some initial states to the different elements of an input graph, and combining them considering how these elements are connected in the graph. An iterative algorithm updates the elements’ state and uses the resulting states to produce an output. The particularities of the problem to solve will determine which GNN variant is more suitable, depending on, for instance, the nature of the graph elements (i.e., nodes and edges) involved.
消息传递神经网络 (MPNN) [18] 是一种著名的 GNN,它采用迭代消息传递算法在图节点之间传播信息。在信息传递步骤中,每个节点 k k kk 都会接收来自其邻域(以 N ( k ) N ( k ) N(k)N(k) 表示)中所有节点的信息。消息是通过对图中节点对的隐藏状态应用消息函数 m ( ) m ( ) m(*)m(\cdot) 生成的。然后,通过聚合函数(例如和)将它们合并(式 (1))。最后,使用更新函数 u ( ) u ( ) u(*)u(\cdot) 为每个节点计算新的隐藏状态(式 (2))。
Message Passing Neural Networks (MPNN) [18] are a well-known type of GNNs that apply an iterative message-passing algorithm to propagate information between the nodes of the graph. In a messagepassing step, each node k k kk receives messages from all the nodes in its neighborhood, denoted by N ( k ) N ( k ) N(k)N(k). Messages are generated by applying a message function m ( ) m ( ) m(*)m(\cdot) to the hidden states of node pairs in the graph. Then, they are combined by an aggregation function, for instance, a sum (Eq. (1)). Finally, an update function u ( ) u ( ) u(*)u(\cdot) is used to compute a new hidden state for every node (Eq. (2)).
M k t + 1 = i N ( k ) m ( h k t , h i t ) M k t + 1 = i N ( k ) m h k t , h i t M_(k)^(t+1)=sum_(i in N(k))m(h_(k)^(t),h_(i)^(t))M_{k}^{t+1}=\sum_{i \in N(k)} m\left(h_{k}^{t}, h_{i}^{t}\right)
h k t + 1 = u ( h k t , M k t + 1 ) h k t + 1 = u h k t , M k t + 1 h_(k)^(t+1)=u(h_(k)^(t),M_(k)^(t+1))h_{k}^{t+1}=u\left(h_{k}^{t}, M_{k}^{t+1}\right)
其中函数 m ( ) m ( ) m(*)m(\cdot) u ( ) u ( ) u(*)u(\cdot) 可以通过神经网络学习。经过一定次数的迭代后,最终的节点状态会被一个读出函数 r ( ) r ( ) r(*)r(\cdot) 使用,以产生给定任务的输出。该函数也可由神经网络实现,其任务通常是预测单个节点的属性(如节点的类别)或图的全局属性。
Where functions m ( ) m ( ) m(*)m(\cdot) and u ( ) u ( ) u(*)u(\cdot) can be learned by neural networks. After a certain number of iterations, the final node states are used by a readout function r ( ) r ( ) r(*)r(\cdot) to produce an output for the given task. This function can also be implemented by a neural network and is typically tasked to predict properties of individual nodes (e.g., the node’s class) or global properties of the graph.
GNN 能够在多个领域取得相关的性能结果,而这些领域的数据通常结构为图 [18,22]。由于计算机网络从根本上被表示为图,因此与传统的神经网络架构(如全连接网络、卷积网络等)相比,GNN 在网络建模方面具有与生俱来的优势。
GNNs have been able to achieve relevant performance results in multiple domains where data is typically structured as a graph [18,22]. Since computer networks are fundamentally represented as graphs, it is inherent in their design that GNNs offer unique advantages for network modeling compared to traditional neural network architectures (e.g., fully connected NN, Convolutional NN, etc.).

2.2.深度强化学习
2.2. Deep Reinforcement Learning

DRL 算法旨在学习一种长期策略,从而最大化优化问题中的目标函数。DRL 代理从初始状态开始,通过探索状态和行动空间的迭代过程学习最优策略。这些空间由一组状态 ( S ) ( S ) (S)(S) 和一组行动 ( A ) ( A ) (A)(\mathcal{A}) 表示。给定一个状态 s s ss S S inS\in \mathcal{S} ,代理将执行一个行动 a A a A a inAa \in \mathcal{A} ,该行动会产生一个到新状态 s S s S s^(')in Ss^{\prime} \in S 的过渡,并为代理提供奖励 r r rr 。那么,我们的目标就是找到一种策略,使一集结束时的累积奖励最大化。一集结束的定义取决于要解决的优化问题。
DRL algorithms aim at learning a long-term strategy that leads to maximize an objective function in an optimization problem. DRL agents start from a tabula rasa state and they learn the optimal strategy by an iterative process that explores the state and action spaces. These are denoted by a set of states ( S ) ( S ) (S)(S) and a set of actions ( A ) ( A ) (A)(\mathcal{A}). Given a state s s ss S S inS\in \mathcal{S}, the agent will perform an action a A a A a inAa \in \mathcal{A} that produces a transition to a new state s S s S s^(')in Ss^{\prime} \in S, and will provide the agent with a reward r r rr. Then, the objective is to find a strategy that maximizes the cumulative reward by the end of an episode. The definition of the end of an episode depends on the optimization problem to address.
Q-learning [23] 是一种 RL 算法,其目标是让代理学习一个策略 π : S A π : S A pi:S rarrA\pi: S \rightarrow \mathcal{A} 。该算法创建一个表(又称 Qtable),其中包含所有可能的状态和行动组合。在训练开始时,该表被初始化(例如,用零或随机值),在训练过程中,代理会根据选择行动后获得的奖励更新这些值。这些值被称为 q 值,代表从状态 s s ss 开始应用行动 a a aa 后的预期累积奖励,假设代理在剩余的事件中遵循当前策略 π π pi\pi 。在训练过程中, q q qq 值使用贝尔曼方程更新(见公式 (3)),其中 Q ( s t , a t ) Q s t , a t Q(s_(t),a_(t))Q\left(s_{t}, a_{t}\right) 是时间步 q q qq 值函数, t , α t , α t,alphat, \alpha 是学习率, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) 是时间步 q q qq 值函数, t , α t , α t,alphat, \alpha 是学习率, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) 是学习率, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) 是学习率。
Q-learning [23] is a RL algorithm whose goal is to make an agent learn a policy π : S A π : S A pi:S rarrA\pi: S \rightarrow \mathcal{A}. The algorithm creates a table (a.k.a., qtable) with all the possible combinations of states and actions. At the beginning of the training, the table is initialized (e.g., with zeros or random values) and during training, the agent updates these values according to the rewards obtained after selecting an action. These values, called q-values, represent the expected cumulative reward after applying action a a aa from state s s ss, assuming that the agent follows the current policy π π pi\pi during the rest of the episode. During training, q q qq-values are updated using the Bellman equation (see Eq. (3)) where Q ( s t , a t ) Q s t , a t Q(s_(t),a_(t))Q\left(s_{t}, a_{t}\right) is the q q qq-value function at time-step t , α t , α t,alphat, \alpha is the learning rate, r ( s t , a t ) r s t , a t r(s_(t),a_(t))r\left(s_{t}, a_{t}\right) is the
从状态 s t s t s_(t)s_{t} 中选择行动 a t a t a_(t)a_{t} 所获得的奖励, γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1] 是贴现因子。
reward obtained from selecting action a t a t a_(t)a_{t} from state s t s t s_(t)s_{t} and γ [ 0 , 1 ] γ [ 0 , 1 ] gamma in[0,1]\gamma \in[0,1] is the discount factor.
Q ( s t , a t ) = Q ( s t , a t ) + α ( r ( s t , a t ) + γ max a Q ( s t , a ) Q ( s t , a t ) ) Q s t , a t = Q s t , a t + α r s t , a t + γ max a Q s t , a Q s t , a t {:[Q(s_(t),a_(t))=Q(s_(t),a_(t))+alpha(r(s_(t),a_(t))+:}],[{: gammamax_(a^('))Q(s_(t)^('),a^('))-Q(s_(t),a_(t)))]:}\begin{array}{r} Q\left(s_{t}, a_{t}\right)=Q\left(s_{t}, a_{t}\right)+\alpha\left(r\left(s_{t}, a_{t}\right)+\right. \\ \left.\gamma \max _{a^{\prime}} Q\left(s_{t}^{\prime}, a^{\prime}\right)-Q\left(s_{t}, a_{t}\right)\right) \end{array}
深度 Q 网络(DQN)[24] 是一种基于 Q 学习的更先进算法,它使用深度神经网络(DNN)来逼近 q q qq 值函数。随着 Q 表的大小变大,Q-learning 在从高维状态和行动空间学习策略时会遇到困难。为了克服这个问题,他们建议使用 DNN 作为 q q qq 值函数估计器,依靠 DNN 的泛化能力来估计事先未见的状态和行动的 q 值。因此,一个非常适合理解和概括 DRL 代理输入数据的 DNN 对于代理在面对从未见过的状态(或环境)时取得良好表现至关重要。此外,DQN 还使用经验重放缓冲区来存储过去的连续经验(即存储 { s , a , r , s } s , a , r , s {s,a,r,s^(')}\left\{s, a, r, s^{\prime}\right\} 的元组)。
Deep Q-Network (DQN) [24] is a more advanced algorithm based on Q-learning that uses a Deep Neural Network (DNN) to approximate the q q qq-value function. As the q -table size becomes larger, Q -learning faces difficulties to learn a policy from high dimensional state and action spaces. To overcome this problem, they proposed to use a DNN as a q q qq-value function estimator, relying on the generalization capabilities of DNNs to estimate the q -values of states and actions unseen in advance. For this reason, a NN well suited to understand and generalize over the input data of the DRL agent is crucial for the agents to perform well when facing states (or environments) never seen before. Additionally, DQN uses an experience replay buffer to store past sequential experiences (i.e., stores tuples of { s , a , r , s } s , a , r , s {s,a,r,s^(')}\left\{s, a, r, s^{\prime}\right\} ).

3.网络优化方案
3. Network optimization scenario

本文探讨了基于 GNN 的 DRL 代理解决光传输网络 (OTN) 路由问题的潜力。特别是,我们考虑了基于软件定义网络(Software-Defined Networking)的网络场景,在该场景中,DRL 代理(位于控制平面)对当前网络状态具有全局视图,并且必须在每个流量需求到达时做出路由决策。我们将流量需求视为从源节点发送到目的节点的流量。过去几十年来,光网络界一直在研究这种相关的优化方案,并提出了许多解决方案 [11,15,25]。
In this paper, we explore the potential of a GNN-based DRL agent to address the routing problem in Optical Transport Networks (OTN). Particularly, we consider a network scenario based on Software-Defined Networking, where the DRL agent (located in the control plane) has a global view of the current network state, and has to make routing decisions on every traffic demand as it arrives. We consider a traffic demand as the volume of traffic sent from a source to a destination node. This is a relevant optimization scenario that has been studied in the last decades in the optical networking community, where many solutions have been proposed [11,15,25].
在我们的 OTN 场景中,DRL 代理通过逻辑拓扑在电域做出路由决策,其中节点代表可重构光分插复用器 (ROADM),边缘是连接节点的预定义光路(见图 1)。DRL 代理接收不同带宽要求的流量需求,这些带宽要求由元组 { s r c , d s t { s r c , d s t {src,dst\{s r c, d s t 、带宽 } } }\} 定义,它必须为每个需求选择一条端到端路径。特别是,端到端路径被定义为连接需求源和目的地的光路序列。由于代理在电域运行,流量需求被定义为光数据单元(ODUk)请求,其带宽要求由 ITU-T 建议 G. 709 [26]定义。然后,ODUk 信号被复用为光传输单元(OTUk),即包括前向纠错在内的数据帧。最终,OTUk 帧被映射到拓扑结构光路中的不同光信道上。
In our OTN scenario, the DRL agent makes routing decisions at the electrical domain, over a logical topology where nodes represent Reconfigurable Optical Add-Drop Multiplexers (ROADM) and edges are predefined lightpaths connecting them (see Fig. 1). The DRL agent receives traffic demands with different bandwidth requirements defined by the tuple { s r c , d s t { s r c , d s t {src,dst\{s r c, d s t, bandwidth } } }\}, and it has to select an end-to-end path for every demand. Particularly, end-to-end paths are defined as sequences of lightpaths connecting the source and destination of a demand. Since the agent operates at the electrical domain, traffic demands are defined as requests of Optical Data Units (ODUk), whose bandwidth requirements are defined in the ITU-T Recommendation G. 709 [26]. The ODUk signals are then multiplexed into Optical Transport Units (OTUk), which are data frames including Forward Error Correction. Eventually, OTUk frames are mapped to different optical channels within the lightpaths of the topology.
在这种情况下,路由选择问题被定义为为每个输入源-目的地流量需求寻找最佳路由选择策略。学习过程以目标函数为指导,该目标函数旨在使网络中分配的流量长期最大化。我们认为,如果构成所选端到端路径的所有光路都有足够的可用容量,那么该需求就是分配得当的。请注意,光路是逻辑拓扑中代理运行的边缘。需求不会过期,在 DRL 事件结束前都会占用光路。这对代理来说是一项具有挑战性的任务,因为它不仅要识别网络上的关键资源(如潜在瓶颈),还要处理未来流量需求产生的不确定性。以下约束条件概括了 OTN 场景中的流量需求路由问题:
In this scenario, the routing problem is defined as finding the optimal routing policy for each incoming source-destination traffic demand. The learning process is guided by an objective function that aims to maximize the traffic volume allocated in the network in the long-term. We consider that a demand is properly allocated if there is enough available capacity in all the lightpaths forming the end-to-end path selected. Note that lightpaths are the edges in the logical topology where the agent operates. The demands do not expire, occupying the lightpaths until the end of a DRL episode. This implies a challenging task for the agent, since it has not only to identify critical resources on networks (e.g., potential bottlenecks), but also to deal with the uncertainty in the generation of future traffic demands. The following constraints summarize the traffic demand routing problem in the OTN scenario:
  • 代理必须为每一个传入的流量需求做出有序的路由决策
    The agent must make sequential routing decisions for every incoming traffic demand
  • 流量需求不能通过多条路径分流
    Traffic demands cannot be split over multiple paths
  • 之前的流量需求无法重新路由,它们占用了链路的容量,直到事件结束
    Previous traffic demands cannot be rerouted and they occupy the links’ capacities until the end of the episode
图 1.OTN 路由场景中 DRL 代理的示意图。
Fig. 1. Schematic representation of the DRL agent in the OTN routing scenario.
通过求解马尔可夫决策过程(MDP),可以找到 OTN 优化问题的最优解 [27]。为此,我们可以使用动态编程(Dynamic Programming)等技术,对所有 MDP 状态进行迭代,直至收敛。流量需求分配问题的 MDP 由所有可能的网络拓扑状态和状态之间的转换概率组成。请注意,在我们的方案中,从一个状态到下一个状态的转换概率是统一的。优化求解 MDP 的一个局限性是,对于大型复杂的优化问题来说,这种方法并不可行。随着问题规模的增大,MDP 的状态空间也会增大,其空间复杂度(以状态数计)为 S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right) ,其中 N N NN 为链路可能具有的不同容量数, E E EE 为链路数。因此,为了求解 MDP,算法将花费更多的时间来遍历 MDP 的所有状态。
The optimal solution to the OTN optimization problem can be found by solving its Markov Decision Process (MDP) [27]. To do this, we can use techniques such as Dynamic Programming, which consist of an iterative process over all MDP’s states until convergence. The MDP for the traffic demand allocation problem consists of all the possible network topology states and the transition probabilities between states. Notice that in our scenario we have uniform transition probabilities from one state to the next. One limitation of solving MDPs optimally is that it becomes infeasible for large and complex optimization problems. As the problem size grows, so does the MDP’s state space, where the space complexity (in number of states) is S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right), having N N NN as the number of different capacities a link can have and E E EE as the number of links. Therefore, to solve the MDP the algorithm will spend more time on iterating over all MDP’s states.

4.基于 GNN 的 DRL 代理设计
4. GNN-based DRL agent design

本节将介绍本文提出的 DRL+GNN 架构。一方面,我们有基于 GNN 的 DRL 代理,它定义了要在网络拓扑上应用的操作。这些操作包括在其中一条候选路径上分配需求。我们的 DRL 代理实现了 DQN 算法 [24],其中 q q qq 值函数由 GNN 建模。另一方面,我们有一个环境来定义要解决的优化问题。该环境存储网络拓扑以及链路特征。此外,环境还负责在执行一项操作后生成奖励,这将向代理表明该操作是好是坏。
In this section, we describe the DRL+GNN architecture proposed in this paper. On one side, we have the GNN-based DRL agent which defines the actions to apply on the network topology. These actions consist of allocating the demands on one of the candidate paths. Our DRL agent implements the DQN algorithm [24], where the q q qq-value function is modeled by a GNN. On the other side, we have an environment which defines the optimization problem to solve. This environment stores the network topology, together with the link features. In addition, the environment is responsible of generating the reward once an action is performed, which will indicate the agent if that action was good or not.
学习过程基于一个迭代过程,在每一个时间步骤中,代理都会从环境中接收到一个图结构网络状态观测值。然后,GNN 构建一个图表示,拓扑结构中的链接就是图实体。在这种表示法中,链接隐藏状态的初始化要考虑输入的链接级特征和要评估的路由行动(详见第 4.1-4.3 节)。在这种表示方法中,会根据图结构在链接隐藏状态之间运行迭代信息传递算法。该算法的输出(即新的链接隐藏状态)被汇总到一个全局隐藏状态中,该状态编码拓扑信息,然后由 DNN 进行处理。由于全局隐藏状态的长度是预定义的,因此不同的拓扑大小,全局隐藏状态的长度总是相同的。在信息传递阶段结束时,GNN 会输出一个 q q qq 值估计。这个 q q qq 值会在有限的行动集合中进行评估,最后 DRL 代理会选择 q q qq 值最高的行动。
The learning process is based on an iterative process, where at each time step, the agent receives a graph-structured network state observation from the environment. Then, the GNN constructs a graph representation where the links of the topology are the graph entities. In this representation, the link hidden states are initialized considering the input link-level features and the routing action to evaluate (see more details in Sections 4.1-4.3). With this representation, an iterative message passing algorithm runs between the links’ hidden states according to the graph structure. The output of this algorithm (i.e., new links hidden states) is aggregated into a global hidden state, that encodes topology information, and then is processed by a DNN. This process makes the GNN topology invariant because the global hidden state length is pre-defined and it will always have the same length for different topology sizes. At the end of the message passing phase, the GNN outputs a q q qq-value estimate. This q q qq-value is evaluated over a limited set of actions, and finally the DRL agent selects the action with highest q q qq-value.

4.1.环境  4.1. Environment

网络状态由拓扑链路特征定义,其中包括链路容量和链路间度。前者表示
The network state is defined by the topology links’ features, which include the link capacity and link betweenness. The former indicates
图 2.链接隐藏状态中的动作表示。
Fig. 2. Action representation in the link hidden states.
链路的可用容量。后者是从图论中继承下来的一种中心性度量,表示有多少条路径可能穿越该链路。从实验结果中我们发现,这一特性有助于减少 DRL 代理在调整超参数时的网格搜索。这是因为间度有助于代理更快地收敛到良好的策略。具体来说,我们采用以下方法计算链路间度:对于拓扑中的每一对节点,我们计算 k k kk 条候选路径(例如 k k kk 条最短路径),并维护每个链路的计数器,以显示有多少条路径经过该链路。因此,每个链路上的间隔度就是穿过该链路的端到端路径数除以路径总数。
the amount of capacity available on the link. The latter is a measure of centrality inherited from graph theory that indicates how many paths may potentially traverse the link. From the experimental results we observed that this feature helps reduce the grid search of the hyperparameter tuning for the DRL agent. This is because the betweenness helps the agent converge faster to a good policy. In particular, we compute the link betweenness in the following way: for each pair of nodes in the topology, we compute k k kk candidate paths (e.g., the k k kk shortest paths), and we maintain a per-link counter that indicates how many paths pass through the link. Thus, the betweenness on each link is the number of end-to-end paths crossing the link divided by the total number of paths.

4.2.行动空间  4.2. Action space

在本节中,我们将介绍 DRL+GNN 代理如何表示路由行为。请注意,在大规模真实世界网络中,每个源-目的节点对的可能路由组合数量通常会导致一个高维度的行动空间。这使得 DRL 代理的路由选择问题变得复杂,因为它需要估计所有可能应用的行动(即路由配置)的 q 值。为了解决这个问题,必须精心设计行动空间,以降低维度。此外,为了能够推广到其他拓扑结构,行动空间在不同的拓扑结构中应该是等价的。换句话说,如果训练拓扑中的行动由最短路径表示,那么在评估拓扑中,它们也应该是最短路径。如果行动空间不同(例如,源-目的节点对之间有多条路径),那么代理在学习时就会遇到问题,也就无法很好地泛化。为了充分利用 GNN 的泛化能力,我们将行动以图的形式引入到代理中。这使得行动表示法不受节点和边缘排列的影响,也就是说,一旦 GNN 训练成功,它就能理解任意图结构(即不同网络状态和拓扑结构)上的行动。
In this section we describe how the routing actions are represented in the DRL+GNN agent. Note that the number of possible routing combinations for each source-destination node pair typically results in a high dimensional action space in large-scale real-world networks. This makes the routing problem complex for the DRL agent, since it should estimate the q-values for all the possible actions to apply (i.e., routing configurations). To overcome this problem, the action space must be carefully designed to reduce the dimensionality. In addition, to enable generalization to other topologies, the action space should be equivalent across topologies. In other words, if the actions in the training topology are represented by shortest paths, in the evaluation topology they should also be shortest paths. If the action space would be different (e.g., multiple paths between a source-destination node pair), the agent will have problems learning and it will not generalize well. To leverage the generalization capability of GNN, we introduced the action into the agent in the form of a graph. This makes the action representation invariant to node and edge permutation, which means that, once the GNN is successfully trained, it is able to understand actions over arbitrary graph structures (i.e., over different network states and topologies).
考虑到上述情况,我们将行动集限制为每个源-目的节点对的 k k kk 条候选路径。为了在路由流量的灵活性和评估所有可能行动的成本之间保持良好的平衡,我们为每个源-目的节点对选择了 k = 4 k = 4 k=4k=4 最短路径集(按跳数计算)。这遵循了 [15] 最初提出的标准。请注意,行动集因需要路由的流量需求的源节点和目的节点而异。
Considering the above, we limit the action set to k k kk candidate paths for each source-destination node pair. To maintain a good trade-off between flexibility to route traffic and the cost to evaluate all the possible actions, we selected a set with the k = 4 k = 4 k=4k=4 shortest paths (by number of hops) for each source-destination node pair. This follows a criteria originally proposed by [15]. Note that the action set differs depending on the source and destination nodes of the traffic demand to be routed.
为了表示该操作,我们将其引入网络状态中。特别是,我们考虑了一个额外的链路级特征,即应用路由操作后在链路上分配的带宽。该值与当前待分配流量请求的带宽需求相对应。同样,未包含在路由操作所选路径中的链路,其特征值将设为零。
To represent the action, we introduce it within the network state. Particularly, we consider an additional link-level feature, which is the bandwidth allocated over the link after applying the routing action. This value corresponds to the bandwidth demand of the current traffic request to be allocated. Likewise, the links that are not included in the path selected by the action will have this feature set to zero.
表 1 Table 1
链接隐藏状态的输入特征。 N N NN 对应于隐藏状态向量的大小。
Input features of the link hidden states. N N NN corresponds to the size of the hidden state vector.
符号 Notation 说明 Description
x 1 x 1 x_(1)x_{1} 链接可用容量 Link available capacity
x 2 x 2 x_(2)x_{2} 链接间隔 Link betweenness
x 3 x 3 x_(3)x_{3} 行动矢量(带宽分配)
Action vector (bandwidth allocated)
x 4 x N x 4 x N x_(4)-x_(N)x_{4}-x_{N} 零填充 Zero padding
Notation Description x_(1) Link available capacity x_(2) Link betweenness x_(3) Action vector (bandwidth allocated) x_(4)-x_(N) Zero padding| Notation | Description | | :--- | :--- | | $x_{1}$ | Link available capacity | | $x_{2}$ | Link betweenness | | $x_{3}$ | Action vector (bandwidth allocated) | | $x_{4}-x_{N}$ | Zero padding |
由于我们的 OTN 环境中流量请求的数量有限,且具有不同的离散带宽需求,因此我们用一个 N N NN 元的单击编码向量来表示分配的带宽,其中 N N NN 是向量长度。
Since our OTN environment has a limited number of traffic requests with various discrete bandwidth demands, we represent the bandwidth allocated with a N N NN-element one-hot encoding vector, where N N NN is the vector length.
图 2 展示了一个简单网络场景中链路隐藏状态下的行动表示。一个从节点 1 到节点 5 的流量请求(流量需求为 8 个带宽单位)被分配到由节点 { 1 , 2 , 3 , 5 } { 1 , 2 , 3 , 5 } {1,2,3,5}\{1,2,3,5\} 形成的路径上。表 1 概括描述了链路隐藏状态所包含的特征。这些值既代表了网络状态,也代表了行动,而行动是 q q qq 值函数 Q ( s , a ) Q ( s , a ) Q(s,a)Q(s, a) 建模所需的输入。
Fig. 2 illustrates the representation of the action in the hidden state of the links in a simple network scenario. A traffic request from node 1 to node 5, with a traffic demand of 8 bandwidth units, is allocated over the path formed by the nodes { 1 , 2 , 3 , 5 } { 1 , 2 , 3 , 5 } {1,2,3,5}\{1,2,3,5\}. To summarize, Table 1 provides a description of the features included in the links’ hidden states. These values represent both the network state and the action, which is the input needed to model the q q qq-value function Q ( s , a ) Q ( s , a ) Q(s,a)Q(s, a).
隐藏状态的大小通常大于隐藏状态中的特征数量。这是为了让每个链接都能存储自己的信息(即自己的初始特征)以及来自所有链接邻居的汇总信息(见第 4.3 节)。如果隐藏状态的大小等于链接特征的数量,链接将没有空间来存储相邻链接的信息而不丢失信息。这将导致读出函数后的图嵌入效果不佳。相反,如果状态大小非常大,就会导致 GNN 模型过大,从而对数据产生过拟合。一种常见的方法是设置状态大小大于特征数量,并用零填充向量。
The size of the hidden states is typically larger than the number of features in the hidden states. This is to enable each link to store information of himself (i.e., his own initial features) plus the aggregated information coming from all the links’ neighbors (see Section 4.3). If the hidden state size is equal to the number of link features, the links will not have space to store information about the neighboring links without losing information. This results in a poor graph embedding after the readout function. On the contrary, if the state size is very large, it can lead to a large GNN model, which can overfit to the data. A common approach is to set the state size larger than the number of features and to fill the vector with zeros.

4.3.GNN 架构  4.3. GNN architecture

GNN 模型基于消息传递神经网络 [18] 模型。在我们的案例中,我们考虑的是链路实体,并在所有链路之间执行信息传递过程。我们选择链路实体而非节点实体,是因为链路特征决定了 OTN 路由优化问题。在处理需要结合节点级特征(如 I/O 缓冲区大小、调度算法)的优化问题时,可以添加节点实体。算法 1 显示了消息传递过程的正式描述,其中算法接收链路特征 ( x l ) x l (x_(l))\left(x_{l}\right) 作为输入,并输出 q q qq 值 (q)。
The GNN model is based on the Message Passing Neural Network [18] model. In our case, we consider the link entity and perform the message passing process between all links. We choose link entities, instead of node entities, because the link features are what define the OTN routing optimization problem. Node entities could be added when addressing an optimization problem that needs to incorporate nodelevel features (e.g., I/O buffer size, scheduling algorithm). Algorithm 1 shows a formal description of the message passing process where the algorithm receives as input the links’ features ( x l ) x l (x_(l))\left(x_{l}\right) and outputs a q q qq-value (q).
该算法执行 T T TT 消息传递步骤。图 3 是一个图形表示法,其中算法迭代了所有的
The algorithm performs T T TT message passing steps. A graphical representation can be seen in Fig. 3, where the algorithm iterates over all
图 3.信息传递架构。
Fig. 3. Message passing architecture.
Algorithm 1 Message Passing
    Input : \(\mathbf{x}_{l}\)
    Output: \(\mathbf{h}_{l}^{T}, q\)
    for each \(l \in \mathcal{L}\) do
        \(h_{l}^{0} \leftarrow\left[\mathbf{x}_{l}, 0 \ldots, 0\right]\)
    for \(t=1\) to \(T\) do
        for each \(l \in \mathcal{L}\) do
            \(M_{l}^{t+1}=\sum_{i \in N(l)} m\left(h_{l}^{t}, h_{i}^{t}\right)\)
            \(h_{l}^{t+1}=u\left(h_{l}^{t}, M_{l}^{t+1}\right)\)
    \(r d t \leftarrow \sum_{l \in \mathcal{L}} h_{l}\)
    \(q \leftarrow R(r d t)\)
网络拓扑结构中的链路。对于每个链路,其特征与相邻链路的特征通过全连接的方式结合在一起,对应于图 3 中的 M。根据 GNN 术语,这些操作的输出称为信息。然后,每个链路与其相邻链路计算出的信息将通过元素求和的方式进行汇总(算法 1 中的第 5 行)。然后,使用递归网络(RNN)用新的聚合信息更新链路隐藏状态 h L K h L K h_(LK)h_{L K} (算法 1 第 6 行)。在信息传递阶段结束时,将使用元素求和法汇总得到的链接状态(算法 1 中的第 7 行)。结果通过一个全连接 DNN 传递,该 DNN 模拟 GNN 的读出功能。后一个函数的输出是输入状态和动作的估计 q q qq 值。
links of the network topology. For each link, its features are combined with those of the neighboring links using a fully-connected, corresponding to M in Fig. 3. The outputs of these operations are called messages according to the GNN notation. Then, the messages computed for each link with their neighbors are aggregated using an element-wise sum (line 5 in Algorithm 1). Afterwards, a Recurrent NN (RNN) is used to update the link hidden states h L K h L K h_(LK)h_{L K} with the new aggregated information (line 6 in Algorithm 1). At the end of the message passing phase, the resulting link states are aggregated using an element-wise sum (line 7 in Algorithm 1). The result is passed through a fully-connected DNN which models the readout function of the GNN. The output of this latter function is the estimated q q qq-value of the input state and action.
RNN 的作用是了解链接状态在信息传递阶段的变化情况。随着链接信息在图中的传播,每个隐藏状态都会存储来自相距越来越远的链接的信息。因此,时间的概念就出现了。RNN 是一种 NN 架构,专门用于捕捉连续行为(如文本、视频、时间序列)。此外,一些 RNN 架构(如 GRU)设计用于处理大型序列(如 NLP 中的长文本句子)。具体来说,它们内部包含的门电路旨在缓解梯度消失问题,这是大型序列中常见的问题[28]。这使得 RNNs 即使在 T T TT 较大的情况下,也能在消息传递阶段学习链接的状态如何演变。
The role of the RNN is to learn how the link states change along the message passing phase. As the link information is being spread through the graph, each hidden state will store information from links that are farther and farther apart. Therefore, the concept of time appears. RNNs are a NN architecture that are tailored to capture sequential behavior (e.g., text, video, time-series). In addition, some RNN architectures (e.g., GRU) are designed to process large sequences (e.g., long text sentences in NLP). Specifically, they internally contain gates that are designed to mitigate the vanishing gradients, a common problem with large sequences [28]. This makes RNNs suitable to learn how the links’ state evolve during the message passing phase, even for large T T TT.

4.4. D R L D R L DRLD R L 代理操作
4.4. D R L D R L DRLD R L agent operation

DRL 代理通过与环境互动来运行。在算法 2 中,我们可以看到描述 DRL 代理运行的伪代码。首先,我们通过初始化所有链路特征来初始化环境 env。同时,环境会生成一个流量需求,由元组 { s r c , d s t , b w } { s r c , d s t , b w } {src,dst,bw}\{s r c, d s t, b w\} 和环境状态 s s ss 分配。我们还将累积奖励初始化为零,定义行动集大小,并创建经验重放缓冲区(agt.mem)。然后,我们执行一个 while 循环(第 3-16 行),当网络拓扑中存在无法分配的需求时,循环结束。对于 k = 4 k = 4 k=4k=4 最短路径中的每一条路径,我们都会沿构成路径的所有链接分配需求,并计算出 q q qq 值(第 7-9 行)。得到每个状态-行动对的 q q qq 值后,我们将使用 ϵ ϵ epsilon\epsilon 贪婪探索策略(第 10 行)[24],选择下一个要应用的行动 a a aa 。然后将该操作应用于环境,从而产生新的状态 s s s^(')s^{\prime} 、奖励 r r rr 和标志 Done,表明是否存在某些链路没有足够的容量来支持需求。此外,环境还会返回一个新的流量需求元组 { s r c , d s t , b w } s r c , d s t , b w {src^('),dst^('),bw^(')}\left\{s r c^{\prime}, d s t^{\prime}, b w^{\prime}\right\} 。有关状态转换的信息存储在经验重放缓冲区中(第 13 行)。这些信息将在以后的 agt.replay() 调用(第 15 行)中用于训练 GNN,该调用每 M M MM 次训练迭代执行一次。
The DRL agent operates by interacting with the environment. In Algorithm 2 we can observe a pseudocode describing the DRL agent operation. At the beginning, we initialize the environment env by initializing all the link features. At the same time, the environment generates a traffic demand to be allocated by the tuple { s r c , d s t , b w } { s r c , d s t , b w } {src,dst,bw}\{s r c, d s t, b w\} and an environment state s s ss. We also initialize the cumulative reward to zero, define the action set size and create the experience replay buffer (agt.mem). Afterwards, we execute a while loop (lines 3-16) that finishes when there is some demand that cannot be allocated in the network topology. For each of the k = 4 k = 4 k=4k=4 shortest paths, we allocate the demand along all the links forming the path and compute a q q qq-value (lines 7-9). Once we have the q q qq-value for each state-action pair, the next action a a aa to apply is selected using an ϵ ϵ epsilon\epsilon-greedy exploration strategy (line 10) [24]. The action is then applied to the environment, leading to a new state s s s^(')s^{\prime}, a reward r r rr and a flag Done indicating if there is some link without enough capacity to support the demand. Additionally, the environment returns a new traffic demand tuple { s r c , d s t , b w } s r c , d s t , b w {src^('),dst^('),bw^(')}\left\{s r c^{\prime}, d s t^{\prime}, b w^{\prime}\right\}. The information about the state transition is stored in the experience replay buffer (line 13). This information will be used later on to train the GNN in the agt.replay() call (line 15), which is executed every M M MM training iterations.
Algorithm 2 DRL Agent operation
    \(s, s r c, d s t, b w \leftarrow e n v . i n i t \_e n v()\)
    reward \(\leftarrow 0, k \leftarrow 4\), agt.mem \(\leftarrow\}\), Done \(\leftarrow\) False
    while not Done do
        \(k_{-} q_{-} v a l u e s \leftarrow\{ \}\)
        \(k_{-}\)shortest_paths \(\leftarrow\) compute_ \(k\) _paths \((k, s r c, d s t)\)
        for \(i\) in \(0, \ldots, k\) do
            \(p^{\prime} \leftarrow\) get_path(i, \(k_{-}\)shortest_paths)
            \(s^{\prime} \leftarrow e n v . a l l o c \_d e m a n d(s, p, s r c, d s t, d e m)\)
            \(k_{-} q_{-} v a l u e s[i] \leftarrow\) compute_q_value \(\left(s^{\prime}, p^{\prime}\right)\)
        \(q_{-}\)value \(\leftarrow e p s i l o n_{-} \quad\) reedy \(\left(k_{-} q_{-}\right.\)values, \(\epsilon\) )
        \(a \leftarrow\) get_action(q_value, \(k_{-}\)shortest_paths, \(s\) )
        \(r\), Done, \(s^{\prime}, s r c^{\prime}, d s t^{\prime}, b w^{\prime} \leftarrow e n v . s t e p(s, a)\)
        \(a g t . r m b\left(s, s r c, d s t, b w, a, r, s^{\prime}, s r c^{\prime}, d s t^{\prime}, b w^{\prime}\right)\)
        reward \(\leftarrow\) reward \(+r\)
        If training steps \(\% M=0\) : agt.replay()
        \(s r c \leftarrow s c^{\prime} ; d s t \leftarrow d s t^{\prime} ; b w \leftarrow b w^{\prime}, s \leftarrow s^{\prime}\)

5.实验结果  5. Experimental results

在本节中,我们将对基于 GNN 的 DRL 代理进行评估,以优化第 3 节所述 OTN 场景中的路由配置。特别是,本节的实验将重点评估所提出的 DRL+GNN 代理的性能和泛化能力。随后,在第 6 节中,我们分析了解决方案的可扩展性,并讨论了与生产网络部署相关的其他方面。
In this section we evaluate our GNN-based DRL agent to optimize the routing configuration in the OTN scenario described in Section 3. Particularly, the experiments in this section are focused on evaluating the performance and generalization capabilities of the proposed DRL+GNN agent. Afterwards, in Section 6, we analyze the scalability properties of our solution and discuss other relevant aspects related to the deployment on production networks.

5.1.评估设置  5.1. Evaluation setup

我们使用 Tensorflow [29] 实现了第 4 节所述的 DRL+GNN 解决方案,并在使用 OpenAI Gym 框架 [30] 实现的 OTN 网络模拟器上对其进行了评估。源代码以及所有训练和评估结果均可公开获取 [19]。
We implemented the DRL+GNN solution described in Section 4 with Tensorflow [29] and evaluated it on an OTN network simulator implemented using the OpenAI Gym framework [30]. The source code, together with all the training and evaluation results are publicly available [19].
在 OTN 模拟器中,我们考虑了三种流量需求类型(ODU2、ODU3 和 ODU4),其带宽需求以 ODU0 信号的倍数表示(即分别为 8、32 和 64 个 ODU0 带宽单位)[26]。当 DRL 代理正确分配一个需求时,如果该需求被正确分配,它将立即获得当前流量需求带宽的奖励,否则奖励为 0。如果 DRL 代理选择的路径上的所有链路都有足够的可用容量来承载该需求,我们就认为该需求分配成功。同样,如果流量需求没有被正确分配,则事件结束。流量需求是通过统一选择源-目的节点对和流量需求类型(ODUk)产生的。这就给 DRL 代理带来了更大的困难,因为统一的流量分布阻碍了预测系统的利用,无法预测可能出现的难以分配的需求。换句话说,所有流量需求在未来出现的可能性都是相同的,这就增加了 DRL 代理估算未来预期回报的难度。
In the OTN simulator, we consider three traffic demand types (ODU2, ODU3, and ODU4), whose bandwidth requirements are expressed in terms of multiples of ODU0 signals (i.e., 8,32 , and 64 ODU0 bandwidth units respectively) [26]. When the DRL agent correctly allocates a demand, it receives an immediate reward being the bandwidth of the current traffic demand if it was properly allocated, otherwise the reward is 0 . We consider that a demand is successfully allocated if all the links in the path selected by the DRL agent have enough available capacity to carry such demand. Likewise, episodes end when a traffic demand was not correctly allocated. Traffic demands are generated by uniformly selecting a source-destination node pair and a traffic demand type (ODUk). This makes the problem even more difficult for the DRL agent, since the uniform traffic distribution hinders the exploitation of prediction systems to anticipate possible demands difficult to allocate. In other words, all traffic demands are equally probable to appear in the future, making it more difficult for the DRL agent to estimate the expected future rewards.
我们进行了初步实验,以选择合适的基于梯度的优化算法,并为 DRL+GNN 代理找到超参数值。对于 GNN 模型,我们将链接的隐藏状态 h l h l h_(l)h_{l} 定义为 27 元素向量(填充表 1 中描述的特征)。请注意,隐藏状态的大小与它们可能编码的信息量有关。较大的网络拓扑结构和复杂的网络优化方案可能需要更大的隐藏状态向量。在 GNN 的每次前向传播中,我们都使用 32 个样本的批次执行 T = 7 T = 7 T=7T=7 消息传递步骤。使用的优化器是随机梯度下降算法 [31],学习率为 10 4 10 4 10^(-4)10^{-4} ,动量为 0.9。我们开始
Initial experiments were carried out to choose an appropriate gradi-ent-based optimization algorithm and to find the hyperparameter values for the DRL+GNN agent. For the GNN model, we defined the links’ hidden states h l h l h_(l)h_{l} as 27 -element vectors (filled with the features described in Table 1). Note that the size of the hidden states is related to the amount of information they may potentially encode. Larger network topologies and complex network optimization scenarios might need larger sizes for the hidden state vectors. In every forward propagation of the GNN we execute T = 7 T = 7 T=7T=7 message passing steps using batches of 32 samples. The optimizer used is the Stochastic Gradient Descent [31] with a learning rate of 10 4 10 4 10^(-4)10^{-4} and a momentum of 0.9 . We start the
ϵ ϵ epsilon\epsilon 采用 ϵ = 1.0 ϵ = 1.0 epsilon=1.0\epsilon=1.0 的贪婪探索策略,并在 70 次训练迭代中保持该值。之后, ϵ ϵ epsilon\epsilon 每集呈指数衰减。经验缓冲区存储了 4000 个样本,并以先进先出队列的形式实现(先进先出)。我们对读出函数应用了 l 2 l 2 l2l 2 正则化和滤除,两种情况下的系数均为 0.1。折扣系数 γ γ gamma\gamma 设置为 0.95。
ϵ ϵ epsilon\epsilon-greedy exploration strategy with ϵ = 1.0 ϵ = 1.0 epsilon=1.0\epsilon=1.0 and maintain this value during 70 training iterations. Afterwards, ϵ ϵ epsilon\epsilon decays exponentially every episode. The experience buffer stores 4000 samples and is implemented as a FIFO queue (first in, first out). We applied l 2 l 2 l2l 2 regularization and dropout to the readout function with a coefficient of 0.1 in both cases. The discount factor γ γ gamma\gamma was set to 0.95 .

5.2.方法  5.2. Methodology

我们将 DRL+GNN 代理的评估分为两组实验。在第一组实验中,我们侧重于推理我们解决方案的性能和泛化能力。为了说明问题,我们选择了两个特定的网络场景,并对其进行了广泛分析。作为基线,我们实施了 [15] 中提出的基于 DRL 的系统,这是目前最先进的 OTN 路由优化解决方案。随后,在第 6 节中,我们在真实世界的网络拓扑结构上评估了我们的解决方案,并从计算时间和泛化能力方面分析了它的可扩展性。
We divided the evaluation of our DRL+GNN agent in two sets of experiments. In the first set, we focused on reasoning about the performance and generalization capabilities of our solution. For illustration purposes, we chose two particular network scenarios and analyzed them extensively. As a baseline, we implemented the DRL-based system proposed in [15], a state-of-the-art solution for routing optimization in OTNs. Later on, in Section 6, we evaluated our solution on real-world network topologies and analyzed its scalability in terms of computation time and generalization capabilities.
要找到 OTN 优化问题的最佳 MDP 解,因其复杂性而不可行。以一个有 6 个节点和 8 条边的小型网络拓扑为例,其中链路的容量为 3 ODUO 单位,只有一种可用带宽类型(1 ODU0),有 4 种可能的操作。因此,MDP 的状态数为 5 8 6 5 1 1.17 e 7 5 8 6 5 1 1.17 e 7 5^(8**6**5**1)~~1.17 e75^{8 * 6 * 5 * 1} \approx 1.17 e 7 。要找到 MDP 的解决方案,我们可以使用动态编程算法,如值迭代。然而,这种算法求解 MDP 的时间复杂度为 O ( S 2 A ) O S 2 A O(S^(2)A)O\left(S^{2} A\right) ,其中 S S SS A A AA 分别为状态和行动的数量,以及 S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right) ,其中 N N NN 为链路可能具有的不同容量的数量,E 为链路的数量。
To find the optimal MDP solution to the OTN optimization problem is infeasible due to its complexity. Take as an example a small network topology with 6 nodes and 8 edges, where the links have capacities of 3 ODUO units, there is only one bandwidth type available (1 ODU0) and there are 4 possible actions. The resulting number of states of the MDP is 5 8 6 5 1 1.17 e 7 5 8 6 5 1 1.17 e 7 5^(8**6**5**1)~~1.17 e75^{8 * 6 * 5 * 1} \approx 1.17 e 7. To find a solution to the MDP we can use Dynamic Programming algorithms such as value iteration. However, this algorithm has a time complexity to solve the MDP of O ( S 2 A ) O S 2 A O(S^(2)A)O\left(S^{2} A\right), where S S SS and A A AA are the number of states and actions respectively and S O ( N E ) S O N E S~~O(N^(E))S \approx O\left(N^{\mathrm{E}}\right), having N N NN as the number of different capacities a link can have and E as the number of links.
作为替代方案,我们将 DRL+GNN 代理的性能与理论流体模型(标记为理论流体)进行了比较。该模型是一种理论方法,它认为流量需求可根据其可用容量按比例分配到 k = 4 k = 4 k=4k=4 候选路径中。这种路由策略旨在避免链路拥塞。例如,可用容量低的路径将只承载一小部分新需求的流量。需要注意的是,由于 ODU 需求在实际 OTN 场景中无法拆分,因此该模型无法实现。不过,该模型计算速度快,可作为比较 DRL+GNN 代理性能的参考。此外,我们还使用了负载平衡路由策略(LB),该策略从 k = 4 k = 4 k=4k=4 候选最短路径中均匀随机地选择一条路径来分配流量需求。
As an alternative, we compare the DRL+GNN agent performance with a theoretical fluid model (labeled as Theoretical Fluid). This model is a theoretical approach which considers that traffic demands may be split into the k = 4 k = 4 k=4k=4 candidate paths proportionally to the available capacity they have. This routing policy is aimed at avoiding congestion on links. For instance, paths with low available capacity will carry a small proportion of the traffic volume from new demands. Note that this model is non-realizable because ODU demands cannot be split in real OTN scenarios. However, this model is fast to compute and serves us as a reference to compare the performance of the DRL+GNN agent. In addition, we also use a load balancing routing policy (LB), which selects uniformly random one path among the k = 4 k = 4 k=4k=4 candidate shortest paths to allocate the traffic demand.
我们在 14 节点 Nsfnet 拓扑的 OTN 路由场景中对 DRL+GNN 代理进行了训练[32],我们认为链路代表光路,其容量为 200 ODUO 信号。需要注意的是,链路双向共享容量,不同流量需求的带宽以 ODU0 信号的倍数表示(即 8、32 或 64 个 ODU0 带宽单位)。我们进行了 1000 次训练迭代,其中代理接收流量需求,并将其分配到行动集中的 k = 4 k = 4 k=4k=4 最短路径之一上。我们选择了性能最高的模型与传统路由优化策略和基于 DRL 的先进解决方案进行比较。
We trained the DRL+GNN agent in an OTN routing scenario on the 14-node Nsfnet topology [32], where we considered that the links represent lightpaths with capacity for 200 ODUO signals. Note that the capacity is shared on both directions of the links and that the bandwidth of different traffic demands is expressed in multiples of ODU0 signals (i.e., 8,32 or 64 ODU0 bandwidth units). We ran 1000 training iterations where the agent received traffic demands and allocated them on one of the k = 4 k = 4 k=4k=4 shortest paths available in the action set. The model with highest performance was selected to be benchmarked against traditional routing optimization strategies and state-of-the-art DRL-based solutions.

5.3.与最先进的基于 DRL 的解决方案进行性能评估
5.3. Performance evaluation against state-of-the-art DRL-based solutions

在评估实验中,我们将 DRL+GNN 代理与基于 DRL 的最先进解决方案进行了比较。特别是,我们对 [15] 中提出的解决方案进行了调整,以便在链路双向共享容量的情况下运行。我们在 14 个节点的 Nsfnet 和 24 个节点的 Geant2 拓扑[33]两种网络场景中训练了两种不同的先进 DRL 代理实例。为了提供具有代表性的结果,我们进行了 1000 次统一流量生成实验。需要注意的是,建议的 DRL+GNN 代理和最先进的 DRL 解决方案都是在相同的生成需求列表上进行评估的。
In this evaluation experiment, we compare our DRL+GNN agent against state-of-the-art DRL-based solutions. Particularly, we adapted the solution proposed in [15] to operate in scenarios where links share their capacity in both directions. We trained two different instances of the state-of-the-art DRL agent in two network scenarios: the 14node Nsfnet and the 24-node Geant2 topology [33]. We made 1000 experiments with uniform traffic generation to provide representative results. Note that both, the proposed DRL+GNN agent and the state-of-the-art DRL solution, were evaluated over the same list of generated demands.
我们进行了两次实验,将 DRL+ GNN 的性能与最先进的 DRL(SoA DRL)所取得的结果进行比较。在第一个实验中,我们将 DRL+GNN 代理与在 Nsfnet 上训练的 SoA DRL 代理、LB 路由策略和理论流体模型进行了对比评估。我们在 Nsfnet 拓扑上评估了四种路由策略,并比较了它们的性能。在图 4(a) 中,我们可以看到一个包含 1000 次评估实验结果的 bloxplot。 y y yy 轴表示代理得分,与代理分配的带宽相对应。图 4© 显示了流体模型相对得分的累积分布函数(CDF)。在这个实验中,我们还可以观察到,提议的 DRL+GNN 代理通过分配 6.6 % 6.6 % 6.6%6.6 \% 更多的带宽,略微优于基于 DRL 的 SoA 代理。在第二个实验中,我们在 Geant2 拓扑上评估了相同的模型(DRL+GNN、SoA DRL、LB 和理论流体),但这次 SoA DRL 代理是在 Geant2 上训练的。由此产生的方框图见图 4(b),评估样本的 CDF 见图 4(d)。同样,在这种情况下,我们的代理表现略好于 SoA DRL 方法( 3 % 3 % 3%3 \% 更多带宽)。
We run two experiments to compare the performance of our DRL+ GNN with the results obtained by the state-of-the-art DRL (SoA DRL). In the first experiment, we evaluated the DRL+GNN agent against the SoA DRL agent trained on Nsfnet, the LB routing policy, and the theoretical fluid model. We evaluated the four routing strategies on the Nsfnet topology and compared their performance. In Fig. 4(a), we can observe a bloxplot with the evaluation results of 1000 evaluation experiments. The y y yy-axis indicates the agent score, which corresponds to the bandwidth allocated by the agent. Fig. 4© shows the Cumulative Distribution Function (CDF) of the relative score obtained with respect to the fluid model. In this experiment we could also observe that the proposed DRL+GNN agent slightly outperforms the SoA DRL-based by allocating 6.6 % 6.6 % 6.6%6.6 \% more bandwidth. In the second experiment, we evaluated the same models (DRL+GNN, SoA DRL, LB, and Theoretical Fluid) on the Geant2 topology, but in this case the SoA DRL agent was trained on Geant2. The resulting boxplot can be seen in Fig. 4(b) and the CDF of the evaluation samples in Fig. 4(d). Similarly, in this case our agent performs slightly better than the SoA DRL approach ( 3 % 3 % 3%3 \% more bandwidth).
我们还进行了另一项实验,以比较 DRL+GNN 代理的泛化能力。在该实验中,我们将 DRL+GNN 代理(在 Nsfnet 上训练)与在 Nsfnet 上训练的 SoA DRL 代理进行了对比,并在 Geant2 拓扑上对两个代理进行了评估。得出的方框图见图 5(a),相应的 CDF 见图 5(b)。结果表明,在这种情况下,DRL+GNN 代理也优于 SoA DRL 代理。在这种情况下,在 80 % 80 % 80%80 \% 的实验中,我们的 DRL+GNN 代理比 SoA DRL 建议的性能提高了 45 % 45 % 45%45 \% 以上。这些结果表明,虽然提议的 DRL+GNN 代理能够在未见过的 Geant2 拓扑中进行泛化并取得出色的性能(图 5(a) 和 5(b)),但 SoA DRL 代理在应用于训练期间未见过的拓扑时表现不佳。这表明与本文提出的代理相比,后一种基于 DRL 的解决方案缺乏泛化能力。
We run another experiment to compare the generalization capabilities of our DRL+GNN agent. In this experiment, we evaluated the DRL+GNN agent (trained on Nsfnet) against the SoA DRL agent trained on Nsfnet, and evaluated both agents on the Geant2 topology. The resulting boxplot can be seen in Fig. 5(a) and the corresponding CDF in Fig. 5(b). The results indicate that in this scenario the DRL+GNN agent also outperforms the SoA DRL agent. In this case, in 80 % 80 % 80%80 \% of the experiments our DRL+GNN agent achieved more than 45 % 45 % 45%45 \% of performance improvement with respect to the SoA DRL proposal. These results show that while the proposed DRL+GNN agent is able to generalize and achieve outstanding performance in the unseen Geant2 topology (Figs. 5(a) and 5(b)), the SoA DRL agent performs poorly when applied to topologies not seen during training. This reveals the lack of generalization capability of the latter DRL-based solution compared to the agent proposed in this paper.
本小节将介绍一个使用案例,评估我们的 DRL+GNN 代理能否成功适应网络拓扑结构的变化。为此,我们考虑了存在链路故障的网络情况。以往的研究表明,现实世界的网络拓扑结构会随着时间的推移而发生变化(例如,由于链路故障)[1,34,35]。网络连通性的这些变化是不可预测的,对协议收敛 [34] 或实现网络优化目标 [1] 有重大影响。
This subsection presents a use case where we evaluate if our DRL+GNN agent can adapt successfully to changes in the network topology. For this, we consider the case of a network with link failures. Previous work showed that real-world network topologies change during time (e.g., due to link failures) [1,34,35]. These changes in network connectivity are unpredictable and they have a significant impact in protocol convergence [34] or on fulfilling network optimization goals [1].
在这次评估中,我们考虑了一系列可能发生多达 10 次链路故障的场景。因此,DRL+GNN 代理的任务是找到新的路由配置,避开受影响的链路,同时最大化分配的总带宽。我们进行了从 Geant2 拓扑中随机移除 n [ 1 , 10 ] n [ 1 , 10 ] n in[1,10]n \in[1,10] 链路的实验。我们将 DRL+GNN 代理获得的分数(即分配的带宽)与理论流体模型进行了比较。图 6(a) 显示了 1000 次实验的平均得分(y 轴)与链路故障数(x 轴)的函数关系。从图中可以看出,即使在 10 次并发链路故障的极端情况下,DRL+GNN 代理也能保持比理论基线更好的性能。同样,图 6(b) 显示了 DRL+GNN 代理与理论流体模型的相对得分。与前面的结果一致,随着拓扑中链路的移除,相对得分保持不变。这表明所提出的 DRL+GNN 架构能够适应拓扑变化。
In this evaluation, we considered a range of scenarios that can experience up to 10 link failures. Thus, the DRL+GNN agent is tasked to find new routing configurations that avoid the affected links while still maximizing the total bandwidth allocated. We executed experiments where n [ 1 , 10 ] n [ 1 , 10 ] n in[1,10]n \in[1,10] links are randomly removed from the Geant2 topology. We compare the score (i.e., bandwidth allocated) achieved by the DRL+GNN agent with respect to the theoretical fluid model. Fig. 6(a) shows the average score over 1000 experiments (y-axis) as a function of the number of link failures (x-axis). There, we can observe that the DRL+GNN agent can maintain better performance than the theoretical baseline even in the extreme case of 10 concurrent link failures. Likewise, Fig. 6(b) shows the relative score of our DRL+GNN agent against the theoretical fluid model. In line with the previous results, the relative score is maintained as links are removed from the topology. This suggests that the proposed DRL+GNN architecture is able to adapt to topology changes.
图 4.与最先进 DRL 相比的性能评估。请注意,4© 和 4(d) 中的垂直线表示与理论流体模型相同的性能。
Fig. 4. Performance evaluation against state-of-the-art DRL. Notice that the vertical lines in 4© and 4(d) indicate the same performance as the theoretical fluid model.
图 5.在 Geant2 上对在 Nsfnet 上训练的基于 DRL 的解决方案进行的评估。
Fig. 5. Evaluation on Geant2 of DRL-based solutions trained on Nsfnet.

6.讨论部署问题
6. Discussion on deployment

在本节中,我们将分析和讨论 DRL+GNN 架构在生产网络中部署的相关方面。在自动驾驶网络中,如果没有泛化能力,DRL 就无法取得成功。培训 DRL 代理需要在网络中使用可能会中断服务的配置。因此,在客户网络中进行培训可能是不可行的。有了通用化功能,DRL 代理就可以在受控实验室(例如在供应商的设施内)进行培训,然后运送给客户。一旦部署完成,它就能在一个看不见的网络中有效运行,或
In this section we analyze and discuss relevant aspects of our DRL+GNN architecture towards deployment in production networks. In the context of self-driving networks, DRL cannot succeed without generalization capabilities. Training a DRL agent requires instrumenting the network with configurations that may disrupt the service. As a result, training in the customer’s network may be unfeasible. With generalization capabilities, the DRL agent can be trained in a controlled lab (for instance at the vendor’s facilities) and shipped to the customer. Once deployed, it can operate efficiently in an unseen network or
场景。图 7 展示了基于 DRL+GNN 架构的产品的培训和部署过程。
scenario. Fig. 7 illustrates this training and deployment process of a product based on our DRL+GNN architecture.
为了更好地了解这种产品在成本和通用性方面的技术可行性和可扩展性,我们设计了两个实验。首先,我们通过在单个(小型)网络中训练代理,并评估其在合成和真实世界网络拓扑中的性能,来分析代理的有效性如何随网络规模而扩展。其次,我们从部署后的计算时间(即代理做出路由决策所需的时间)来分析代理的可扩展性。这在实时网络场景中尤为重要。
To better understand the technical feasibility and scalability properties of such a product in terms of cost and generalization, we designed two experiments. First, we analyze how the effectiveness of our agent scales with the network size, by training it in a single (small) network and evaluating its performance in synthetic and real-world network topologies. Second, we analyze the scalability of our agent in terms of computation time after deployment (i.e., the time it takes for the agent to make routing decisions). This is particularly relevant in real-time networking scenarios.
图 6.DRL+GNN 对链路故障用例的评估。
Fig. 6. DRL+GNN evaluation on a use case with link failures.
图 7.将 DRL+GNN 纳入产品的部署流程概览。
Fig. 7. DRL+GNN deployment process overview by incorporating it into a product.

6.1.网络拓扑的泛化
6.1. Generalization over network topologies

在这两种情况下,我们都使用了在单一拓扑(14 个节点的 Nsfnet)中训练的 DRL+GNN 代理,并分析了它在训练过程中未见的更大拓扑(多达 100 个节点)中的性能。
In both scenarios we used the DRL+GNN agent trained in a single topology (14 nodes Nsfnet) and we analyzed its performance in larger topologies (up to 100 nodes) not seen during training.

6.1.1.合成拓扑
6.1.1. Synthetic topologies

在这项实验中,我们总共生成了 180 个节点数量不断增加的合成拓扑。对于每种拓扑规模(以节点数计),我们都生成了 20 个拓扑,并在 1000 个事件中对代理进行了评估。为此,我们使用 NetworkX python 库[36]生成了 20 到 100 个节点的随机网络拓扑,其平均节点度与 Nsfnet 相似。这样我们就能分析网络规模对性能的影响。
In this experiment we generated a total of 180 synthetic topologies with an increasing number of nodes. For each topology size - in number of nodes - we generated 20 topologies and we evaluated the agent on 1000 episodes. To do this, we used the NetworkX python library [36] to generate random network topologies between 20 and 100 nodes with similar average node degree to Nsfnet. This allows us to analyze how the network size affects the performance.
图 8(a) 显示了性能与拓扑结构大小成反比的情况。出于基准测试的目的,我们计算了与理论流体模型的相对分数。代理在未见拓扑结构中表现出色。例如,在 30 个节点的拓扑结构中,代理的表现与理论流体模型相似,而这一拓扑结构的规模是训练期间所见的 14 个节点拓扑结构的两倍。此外,在 100 节点拓扑结构中,我们观察到性能仅下降了 15 % 15 % 15%15 \% 。这一结果表明,我们的解决方案的泛化特性会随着网络规模的增大而优雅地下降。众所周知,深度学习模型的泛化能力会随着训练过程中数据分布与评估样本的不同而下降(见第 6.3 节)。
Fig. 8(a) shows how the performance scales inversely with the topology size. For benchmark purposes, we computed the relative score with respect to the theoretical fluid model. The agent shows a remarkable performance in unseen topologies. As an example, the agent has a similar performance to the theoretical fluid model in the 30 -node topologies, which double the size of the single 14 -node topology seen during training. In addition, in the 100-node topologies, we observe only a 15 % 15 % 15%15 \% drop in performance. This result shows that the generalization properties of our solution degrade gracefully with the size of the network. It is well-known that deep learning models lose generalization capability as the distribution of the data seen during training differs from the evaluation samples (see Section 6.3).

6.1.2.真实世界的网络拓扑结构
6.1.2. Real-world network topologies

在本节中,我们将评估在 Nsfnet 中训练的 DRL+GNN 代理在从 Topology Zoo [2] 数据集获取的 232 个真实拓扑上的泛化能力。具体来说,我们将所有
In this section we evaluate the generalization capabilities of our DRL+GNN agent, trained in Nsfnet, on 232 real-world topologies obtained from the Topology Zoo [2] dataset. Specifically, we take all
表 2 Table 2
真实拓扑特征(最小值和最大值)。
Real-world topology features (minimum and maximum values).
特点 Feature 最低 Minimum 最大 Maximum
节点数 Num. nodes 6 92
边缘数 Num. edges 5 101
平均节点度 Avg. node degree 1.667 8
节点度变量 Var. node degree 0.001 41.415
直径 Diameter 1 31
Feature Minimum Maximum Num. nodes 6 92 Num. edges 5 101 Avg. node degree 1.667 8 Var. node degree 0.001 41.415 Diameter 1 31| Feature | Minimum | Maximum | | :--- | :--- | :--- | | Num. nodes | 6 | 92 | | Num. edges | 5 | 101 | | Avg. node degree | 1.667 | 8 | | Var. node degree | 0.001 | 41.415 | | Diameter | 1 | 31 |
的拓扑。在表 2 中,我们可以看到从结果拓扑中提取的特征。直径特征与最大偏心率(即一个节点到另一个节点的最大距离)相对应。不同拓扑特征的范围表明,我们的拓扑数据集包含不同的拓扑分布。
the topologies that have up to 100 nodes. In Table 2 we can see the features extracted from the resulting topologies. The diameter feature corresponds to the maximum eccentricity (i.e., maximum distance from one node to another node). The ranges of the different topology features indicate that our topology dataset contains different topology distributions.
我们执行了 1000 个评估事件,并计算了 DRL+GNN 代理、LB 和理论流体路由策略在每个拓扑中获得的平均奖励。然后,我们计算了我们的代理和 LB 策略与理论流体模型的相对性能(单位:%)。图 8(b) 显示了结果,为了便于阅读,我们根据 DRL+GNN 代理和 LB 策略的得分差异对拓扑进行了排序。在图的左侧,我们可以观察到一些拓扑样本,其中三种路由策略的得分不谋而合。在每个输入流量需求的路由路径不多的拓扑结构中(如环形或星形拓扑结构),这种行为是正常的。随着路径数量的增加,必须进行路由优化,以最大限度地提高分配的流量需求数量。
We executed 1000 evaluation episodes and computed the average reward achieved by the DRL+GNN agents, the LB, and the theoretical fluid routing strategies for each topology. Then, we computed the relative performance (in %) of our agent and the LB policy with respect to the theoretical fluid model. Fig. 8(b) shows the results where, for readability, we sort the topologies according to the difference of score between the DRL+GNN agent and the LB policy. In the left side of the figure we observe some topology samples where the scores of all three routing strategies coincide. This kind of behavior is normal in topologies where for each input traffic demand, there are not many paths to route the traffic demand (e.g., in ring or star topologies). As the number of paths increases, routing optimization becomes necessary to maximize the number of traffic demands allocated.
我们还只在 Geant2 拓扑中训练了 DRL+GNN 代理。在所有实际拓扑中评估模型的平均相对得分(相对于理论流体)为 + 4.78 % + 4.78 % +4.78%+4.78 \% 。限于篇幅,我们略去此图。这些结果表明,我们的 DRL+GNN 架构可以很好地泛化到训练过程中从未见过的拓扑,与训练过程中使用的拓扑无关。
We also trained a DRL+GNN agent only in the Geant2 topology. The mean relative score (with respect to the theoretical fluid) of evaluating the model on all real-world topologies was + 4.78 % + 4.78 % +4.78%+4.78 \%. In the interest of space, we omit this figure. These results indicate that our DRL+GNN architecture generalizes well to topologies never seen during training, independently of the topology used during training.
这些实验表明,我们的架构在现实世界的拓扑结构中具有强大的运行能力,这些拓扑结构在很大程度上不同于训练时所看到的场景。即使在单个 14 节点的拓扑结构中进行训练,代理也能在多达 100 个节点的拓扑结构中取得良好的性能。
These experiments show the robustness of our architecture to operate in real-world topologies that largely differ from the scenarios seen during training. Even when trained in a single 14-node topology, the agent achieves good performance in topologies of up to 100 nodes.

6.2.计算时间  6.2. Computation time

在本节中,我们将分析已经训练过的 DRL+GNN 代理在现实场景中部署时的计算时间。为此,我们使用了之前在第 6.1.1 节中生成的合成拓扑结构,对每个拓扑结构执行了 1000 个事件,并测量了计算时间。这是指代理选择最佳路径分配所有传入流量请求所需的时间。对于
In this section we analyze the computation time of an already trained DRL+GNN agent when deployed in a realistic scenario. For this purpose, we used the synthetic topologies generated before in Section 6.1.1, and we executed 1000 episodes for each one and we measured the computation time. This is the time the agent takes to select the best path to allocate all the incoming traffic requests. For
图 8.与流体模型相比,DRL+GNN 在 180 个合成拓扑(a)和 232 个实际拓扑(b)中的相对性能。
Fig. 8. DRL+GNN relative performance with respect to the fluid model over 180 synthetic topologies (a) and 232 real-world topologies (b).
图 9.不同拓扑规模下 DRL+GNN 的平均计算时间(毫秒)。
Fig. 9. DRL+GNN average computation time (in milliseconds) over different topology sizes.
在本实验中,我们使用了不含任何特定硬件加速器的现成硬件(64 位 Ubuntu 16.04 LTS,处理器为英特尔酷睿 i5-8400,内核 2.80 GHz × 6 2.80 GHz × 6 2.80GHzxx62.80 \mathrm{GHz} \times 6 ,内存为 8 GB)。这些结果只能作为分析我们解决方案可扩展性的参考。在网络设备中的实际实施将得到高度优化。
this experiment we used off-the-shelf hardware without any specific hardware accelerator (64-bit Ubuntu 16.04 LTS with processor Intel Core i5-8400 with 2.80 GHz × 6 2.80 GHz × 6 2.80GHzxx62.80 \mathrm{GHz} \times 6 cores and 8 GB of RAM memory). Results should be understood only as a reference to analyze the scalability properties of our solution. Real implementations in a network device would be highly optimized.
图 9 显示了所有事件的计算时间。点代表所有事件的平均代理操作时间,置信区间代表 5/95 百分位数。执行时间约为几 m s m s msm s ,并随拓扑结构的大小呈线性增长。由于 GNN 中信息传递的设计方式,这种情况是意料之中的。结果表明,就部署而言,拟议的 DRL+GNN 代理具有有趣的特点。作为优化算法,它能够优化未见网络,并取得良好的性能;而作为启发式算法,它只需一个步骤和几十毫秒即可完成。
Fig. 9 shows the computation time for all episodes. The dots correspond to the average agent operation time over all the episodes and the confidence interval corresponds to the 5/95 percentiles. The execution time is in the order of few m s m s msm s and grows linearly with the size of the topology. This is expected due to the way the messagepassing in the GNN has been designed. The results indicate that, in terms of deployment, the proposed DRL+GNN agent has interesting features. It is capable of optimizing unseen networks achieving good performance, as optimization algorithms, but in one single step and in tens of milliseconds, as heuristics.

6.3.讨论情况  6.3. Discussion

在本文中,我们提出了一种数据驱动型解决方案,用于解决 OTN 中的路由问题。这意味着,我们的 DRL 代理可以从过去与环境的交互中获得的数据中学习。这种方法有一个主要局限,即在对超出分布范围的数据进行评估时,其性能预计会下降。在我们的方案中,外部分布是指与网络拓扑、链路特征和流量矩阵相关的任何数据,这些数据与训练过程中看到的数据截然不同。
In this paper we propose a data-driven solution to solve a routing problem in OTN. This means that our DRL agent learns from data that is obtained from past interactions with the environment. This method has the main limitation that when evaluated on out-of-distribution data, its performance is expected to drop. In our scenario, out-of-distribution is any data related to network topology, link features and traffic matrix that is radically different from the data seen during the training process.
在合成拓扑和实际拓扑(分别为第 6.1.1 节和第 6.1.2 节)上的实验结果表明,DRL+GNN 架构在某些拓扑上存在性能问题。性能下降与训练过程中使用的拓扑结构的网络特性不同有关。链路特征经过归一化和
The experimental results on synthetic and real-world topologies (Section 6.1.1 and Section 6.1.2 respectively) show that the DRL+GNN architecture has performance issues on some topologies. This performance drop is related to the diverging network characteristics from the topology used during training. The link features are normalized and
表 3 Table 3
合成网络拓扑的特征。数值对应于每个拓扑大小的所有拓扑的平均值。作为参考,第一行对应的是训练时使用的 Nsfnet 拓扑。
Features for the Synthetic network topologies. The values correspond to the mean of all topologies from each topology size. As a reference, the first row corresponds to the Nsfnet topology used during training.
拓扑结构大小 
Topology
size
Topology size| Topology | | :--- | | size |
平均节点度
Mean
node
degree
Mean node degree| Mean | | :--- | | node | | degree |
节点度变量 
Var. node
degree
Var. node degree| Var. node | | :--- | | degree |
之间的节点。 
Node
betwee.
Node betwee.| Node | | :--- | | betwee. |
边缘 
Edge
betwee.
Edge betwee.| Edge | | :--- | | betwee. |
DRL+GNN 与流体的比率 (%)
DRL+GNN
perf. w.r.t.
fluid (%)
DRL+GNN perf. w.r.t. fluid (%)| DRL+GNN | | :--- | | perf. w.r.t. | | fluid (%) |
Nsfnet (培训) 
Nsfnet
(training)
Nsfnet (training)| Nsfnet | | :--- | | (training) |
3 3 3\mathbf{3} 0 . 2 8 5 7 0 . 2 8 5 7 0.2857\mathbf{0 . 2 8 5 7} 0 . 0 9 5 2 0 . 0 9 5 2 0.0952\mathbf{0 . 0 9 5 2} 0 . 1 0 2 0 0 . 1 0 2 0 0.1020\mathbf{0 . 1 0 2 0} -
20 个节点 20 Nodes 2.90 0.1050 0.1036 0.0988 4.305
30 个节点 30 Nodes 2.93 0.0956 0.0844 0.0764 -0.649
40 个节点 40 Nodes 2.95 0.1025 0.0704 0.0623 -3.945
50 个节点 50 Nodes 2.96 0.1104 0.0620 0.0538 -6.422
60 个节点 60 Nodes 2.97 0.1056 0.0559 0.0476 -8.103
70 个节点 70 Nodes 2.97 0.0920 0.0522 0.0437 -10.064
80 个节点 80 Nodes 2.98 0.0956 0.0474 0.0395 -11.380
90 个节点 90 Nodes 2.98 0.1062 0.0436 0.0361 -13.610
"Topology size" "Mean node degree" "Var. node degree" "Node betwee." "Edge betwee." "DRL+GNN perf. w.r.t. fluid (%)" "Nsfnet (training)" 3 0.2857 0.0952 0.1020 - 20 Nodes 2.90 0.1050 0.1036 0.0988 4.305 30 Nodes 2.93 0.0956 0.0844 0.0764 -0.649 40 Nodes 2.95 0.1025 0.0704 0.0623 -3.945 50 Nodes 2.96 0.1104 0.0620 0.0538 -6.422 60 Nodes 2.97 0.1056 0.0559 0.0476 -8.103 70 Nodes 2.97 0.0920 0.0522 0.0437 -10.064 80 Nodes 2.98 0.0956 0.0474 0.0395 -11.380 90 Nodes 2.98 0.1062 0.0436 0.0361 -13.610| Topology <br> size | Mean <br> node <br> degree | Var. node <br> degree | Node <br> betwee. | Edge <br> betwee. | DRL+GNN <br> perf. w.r.t. <br> fluid (%) | | :--- | :--- | :--- | :--- | :--- | :--- | | Nsfnet <br> (training) | $\mathbf{3}$ | $\mathbf{0 . 2 8 5 7}$ | $\mathbf{0 . 0 9 5 2}$ | $\mathbf{0 . 1 0 2 0}$ | - | | 20 Nodes | 2.90 | 0.1050 | 0.1036 | 0.0988 | 4.305 | | 30 Nodes | 2.93 | 0.0956 | 0.0844 | 0.0764 | -0.649 | | 40 Nodes | 2.95 | 0.1025 | 0.0704 | 0.0623 | -3.945 | | 50 Nodes | 2.96 | 0.1104 | 0.0620 | 0.0538 | -6.422 | | 60 Nodes | 2.97 | 0.1056 | 0.0559 | 0.0476 | -8.103 | | 70 Nodes | 2.97 | 0.0920 | 0.0522 | 0.0437 | -10.064 | | 80 Nodes | 2.98 | 0.0956 | 0.0474 | 0.0395 | -11.380 | | 90 Nodes | 2.98 | 0.1062 | 0.0436 | 0.0361 | -13.610 |
表 4 Table 4
真实世界网络拓扑的特征。相对性能是 1000 次评估的平均值。作为参考,第一行对应的是训练时使用的 Nsfnet 拓扑。
Features for the real-world network topologies. The relative performance is the mean of 1000 evaluation episodes. As a reference, the first row corresponds to the Nsfnet topology used during training.
拓扑 id 
Topology
id
Topology id| Topology | | :--- | | id |
平均节点度
Avg
node
degree
Avg node degree| Avg | | :--- | | node | | degree |
节点度变量 
Var node
degree
Var node degree| Var node | | :--- | | degree |
之间的节点。 
Node
betwee.
Node betwee.| Node | | :--- | | betwee. |
边缘 
Edge
betwee.
Edge betwee.| Edge | | :--- | | betwee. |
DRL+GNN 与流体的比率 (%)
DRL+GNN
perf. w.r.t.
fluid (%)
DRL+GNN perf. w.r.t. fluid (%)| DRL+GNN | | :--- | | perf. w.r.t. | | fluid (%) |
Nsfnet (培训) 
Nsfnet
(training)
Nsfnet (training)| Nsfnet | | :--- | | (training) |
3 3 3\mathbf{3} 0 . 2 9 0 . 2 9 0.29\mathbf{0 . 2 9} 0 . 0 9 5 2 0 . 0 9 5 2 0.0952\mathbf{0 . 0 9 5 2} 0 . 1 0 2 0 0 . 1 0 2 0 0.1020\mathbf{0 . 1 0 2 0} -
0 2.42 9.59 0.0410 0.0484 -27.357
1 3.51 15.17 0.0447 0.0394 -23.944
2 2.00 41.41 0.0180 0.0298 -25.965
229 2.31 0.98 0.1294 0.1615 19.066
230 2.06 1.75 0.1140 0.1340 18.570
231 2.07 2.22 0.0994 0.1244 27.430
"Topology id" "Avg node degree" "Var node degree" "Node betwee." "Edge betwee." "DRL+GNN perf. w.r.t. fluid (%)" "Nsfnet (training)" 3 0.29 0.0952 0.1020 - 0 2.42 9.59 0.0410 0.0484 -27.357 1 3.51 15.17 0.0447 0.0394 -23.944 2 2.00 41.41 0.0180 0.0298 -25.965 229 2.31 0.98 0.1294 0.1615 19.066 230 2.06 1.75 0.1140 0.1340 18.570 231 2.07 2.22 0.0994 0.1244 27.430| Topology <br> id | Avg <br> node <br> degree | Var node <br> degree | Node <br> betwee. | Edge <br> betwee. | DRL+GNN <br> perf. w.r.t. <br> fluid (%) | | :--- | :--- | :--- | :--- | :--- | :--- | | Nsfnet <br> (training) | $\mathbf{3}$ | $\mathbf{0 . 2 9}$ | $\mathbf{0 . 0 9 5 2}$ | $\mathbf{0 . 1 0 2 0}$ | - | | 0 | 2.42 | 9.59 | 0.0410 | 0.0484 | -27.357 | | 1 | 3.51 | 15.17 | 0.0447 | 0.0394 | -23.944 | | 2 | 2.00 | 41.41 | 0.0180 | 0.0298 | -25.965 | | 229 | 2.31 | 0.98 | 0.1294 | 0.1615 | 19.066 | | 230 | 2.06 | 1.75 | 0.1140 | 0.1340 | 18.570 | | 231 | 2.07 | 2.22 | 0.0994 | 0.1244 | 27.430 |
流量需求始终具有相同的带宽值,这就排除了性能下降的原因。然而,网络拓扑结构的变化会直接影响 DRL 代理的性能。
the traffic demands always have the same bandwidth values, which excludes them as the source of the performance drop. However, the network topology changes, which has a direct impact on the DRL agent performance.
表 3 显示了每种拓扑规模(节点数)的不同拓扑指标。边间度的计算方法如下:对每条边计算经过该边的所有最短路径对的分数总和,然后求出所有边的平均值。这与节点间度的计算方法类似。此外,还显示了 DRL+GNN 相对于理论流体模型的性能。这些值是对所有网络拓扑进行评估后,针对每种拓扑规模得出的平均值(即图 8(a) 中结果的平均值)。
Table 3 shows different topology metrics for each topology size (in number of nodes). The edge betweenness is computed in the following way: for each edge compute the sum of the fraction of all-pairs of shortest paths that pass through the edge, and then make the mean of all edges. This applies in a similar way to the node betweenness. In addition, the DRL+GNN’s performance with respect to the Theoretical Fluid model is also shown. The values correspond to the means from evaluating on all network topologies for each topology size (i.e., the means from the results in Fig. 8(a)).
尽管合成拓扑的生成方式与 Nsfnet 类似,具有相似的节点度,但我们可以看到,随着拓扑变大,其他指标也出现了偏差。具体来说,节点间度和边间度都变小了,这表明成对的节点和边的节点间度都变小了。
Even though the synthetic topologies were generated in a way to have a similar node degree like Nsfnet, we can see that other metrics diverge as the topologies become larger. Specifically, the node and edge betweenness become smaller, which indicates that the pairs of
最短路径分布更广。换句话说,对于小型拓扑结构,节点和边的最短路径交叉比例要高于大型拓扑结构。网络指标清楚地表明,拓扑结构与 Nsfnet 的差异越大,DRL 的性能就越差。
shortest paths are more distributed. In other words, for small topologies the nodes and edges have proportionally more shortest paths crossing them than for larger ones. The network metrics clearly indicate that the more different the topologies are than Nsfnet, the worse is the DRL’s performance.
表 4 显示的是实际拓扑结构的类似表格。在这种情况下,性能结果与图 8(b) 中结果的平均值一致。根据类似的推理,我们可以看到 DRL+GNN 架构性能最差的实际拓扑与 Nsfnet(即图 8(b)中左上角的拓扑)完全不同。此外,我们对 id 为 0、1 和 2 的拓扑进行了可视化,观察到它们对应的拓扑有一些节点具有非常高的连通性(见表 4 中的节点度方差)。与合成拓扑类似,我们再次观察到,拓扑与 Nsfnet 的差异越大,DRL+GNN 的性能就越差。
Table 4 shows a similar table but for the real-world topologies. In this case, the performance results correspond to the means from the results in Fig. 8(b). Following a similar reasoning, we can see that the real-world topologies where the DRL+GNN architecture achieves the worst performance are radically different from Nsfnet (i.e., top left topologies from Fig. 8(b)). In addition, we visualized the topologies with id 0,1 and 2 and observed that they correspond to topologies that have some nodes with a very high connectivity (see the variance of the node degree in Table 4). Similarly to the synthetic topologies, we again observe that the more different the topologies are than Nsfnet, the worse is the DRL+GNN’s performance.
有几种方法可以提高此类拓扑的泛化能力。一种直接的方法是在训练集中加入具有不同特征的拓扑结构。此外,DRL+GNN 架构还可以利用微调传统深度学习技术(如正则化、剔除)加以改进。最后,[37] 的研究表明,使用链接状态的平均值、最小值、最大值和总和的组合来聚合相邻链接的信息可以提高泛化效果。我们认为提高泛化能力超出了我们的工作范围,因此将其作为未来的工作。
There are several things that could be done to improve the generalization capabilities for such topologies. A straightforward approach would be to incorporate topologies with different characteristics to the training set. In addition, the DRL+GNN architecture could be improved using fine-tuned traditional Deep Learning techniques (e.g., regularization, dropout). Finally, the work from [37] suggests that aggregating the information of the neighboring links using a combination of mean, min , max, and sum of the links’ states improves generalization. We consider that improving the generalization is outside the scope of our work and we left it as future work.
网络优化是一个众所周知的成熟课题,其基本目标是高效运行网络。文献中的大多数作品都使用基于整数线性规划(ILP)或约束编程(CP)的传统方法来优化网络状态。文献[1]中的一项相关工作是利用约束编程将网络运营商提出的高层次目标转换为有效的网络配置。文献 [38] 的作者提出了一种基于 ILP 的 OTN 多播路由解决方案。此外,他们还建议使用基于遗传算法的启发式来改进局部最优解并降低计算复杂度。使用遗传算法的类似研究还有 [39,40]。
Network optimization is a well-known and established topic whose fundamental goal is to operate networks efficiently. Most of the works in the literature use traditional methods to optimize a network state based on Integer Linear Programming (ILP) or Constraint Programming (CP). A relevant work is [1] where they convert high-level goals, indicated by the network operator, into valid network configurations using constraint programming. The authors from [38] propose a solution based on ILP for multicast routing in OTN. In addition, they propose to use a heuristic based on genetic algorithms to improve the locally optimal solution and to reduce the computational complexity. Similar work using genetic algorithms are [39,40].
从给定的流量矩阵中找到最优路由配置是一个基本问题,已被证明是 NPhard [41,42]。在这种情况下,人们提出了几种基于 DRL 的解决方案来解决路由优化问题。在 [11] 中,他们提出了一种使用 Q-learning 和卷积神经网络进行频谱分配的 DRL 解决方案。同样,在 [15] 中,他们提出了一种更精细的网络状态表示法,以帮助 DRL 代理轻松捕捉网络拓扑的奇异性。在 [43] 中,他们提出了一种可扩展的方法,利用 DRL 解决大型网络中的交通工程问题。文献[44]将 DRL 与线性规划(Linear Programming)相结合,使最拥堵链路的利用率最小化。在 [45] 中,他们提出并比较了不同的基于 DRL 的算法,以解决 SD-WAN 中的流量工程问题。
To find the optimal routing configuration from a given traffic matrix is a fundamental problem, which has been demonstrated to be NPhard [41,42]. In this context, several DRL-based solutions have been proposed to address routing optimization . In [11] they propose a DRL solution for spectrum assignment using Q-learning and convolutional NNs. Similarly, in [15] they propose a more elaborated representation of the network state to help the DRL agent capture easily the singularities of the network topology. In [43] they propose a scalable method to solve Traffic Engineering problems with DRL in large networks. The work from [44] combines DRL with Linear Programming to minimize the utilization of the most congested link. In [45] they propose and compare different DRL-based algorithms to solve a Traffic Engineering problem in SD-WAN.
然而,大多数基于 DRL 的解决方案都无法推广到未知场景。这是因为它们对来自网络状态的数据进行了预处理,并以固定大小矩阵(如网络拓扑的邻接矩阵)的形式呈现。然后,这些表示由传统的神经网络架构(如全连接、卷积神经网络)进行处理。这些神经架构不适合学习和归纳本质上结构为图的数据。因此,最先进的 DRL 代理在不同的拓扑结构中进行评估时表现不佳,而这些拓扑结构在训练过程中并未出现过。
However, most of the proposed DRL-based solutions fail to generalize to unseen scenarios. This is because they pre-process the data from the network state and present it in the form of fixed-size matrices (e.g., adjacency matrix of a network topology). Then, these representations are processed by traditional neural network architectures (e.g., fully connected, convolutional neural networks). These neural architectures are not suited to learn and generalize over data that is inherently structured as a graph. Consequently, state-of-the-art DRL agents perform poorly when they are evaluated in different topologies that were not seen during the training.
在通信网络领域,人们曾多次尝试使用 GNN。在 [46] 中,他们使用 GNN 以监督学习方法学习最短路径路由和最大最小路由。在 [47] 中,他们
There have been several attempts to use GNN in the communication networks field. In [46] they use GNN to learn shortest-path routing and max-min routing in a supervised learning approach. In [47] they
将 GNN 与 DRL 结合起来解决网络规划问题。另一项相关工作是 [48] 的工作,他们使用 DRL 代理的分布式设置,以分散的方式解决交通工程问题。文献[10]建议使用 GNN 来预测网络指标,并使用传统优化器来寻找使某些网络指标(如平均延迟)最小化的路由。最后,还有人提出利用 GNN 学习数据中心场景中的作业调度策略,而无需人工干预 [49]。
combine GNN with DRL to solve a network planning problem. Another relevant work is the one from [48] where they use a distributed setup of DRL agents to solve a Traffic Engineering problem in a decentralized way. The work from [10] proposes to use GNN to predict network metrics and a traditional optimizer to find the routing that minimizes some network metrics (e.g., average delay). Finally, GNNs have been proposed to learn job scheduling policies in a data-center scenario without human intervention [49].

8.结论  8. Conclusion

在本文中,我们介绍了一种基于 GNN 的 DRL 架构,它能够泛化到未见过的网络拓扑结构中。使用 GNN 对网络环境进行建模,可以让 DRL 代理在不同于训练所用网络的网络中运行。我们认为,缺乏泛化能力是阻碍 DRL 在生产网络中使用和部署的主要障碍。所提出的架构是开发基于 DRL 的新一代网络产品的第一步。
In this paper, we presented a DRL architecture based on GNNs that is able to generalize to unseen network topologies. The use of GNNs to model the network environment allows the DRL agent to operate in different networks than those used for training. We believe that the lack of generalization was the main obstacle preventing the use and deployment of DRL in production networks. The proposed architecture represents a first step towards the development of a new generation of DRL-based products for networking.
为了展示 DRL+GNN 解决方案的泛化能力,我们选择了光网络领域的一个经典问题。该问题是验证我们架构通用性能的基准。我们的结果表明,所提出的 DRL+GNN 代理能够在训练期间从未见过的网络中有效运行。以前基于传统神经网络架构的 DRL 解决方案无法泛化到其他拓扑结构。
In order to show the generalization capabilities of our DRL+GNN solution, we selected a classic problem in the field of optical networks. This served as a baseline benchmark to validate the generalization performance of our architecture. Our results show that the proposed DRL+GNN agent is able to effectively operate in networks never seen during training. Previous DRL solutions based on traditional neural network architectures were not able to generalize to other topologies.
要在自动驾驶网络中部署 DRL 技术,仍需应对的一个基本挑战是其黑箱性质。DRL 无法为所有网络场景提供有保证的性能,其操作也不容易被人类理解。因此,基于 DRL 的解决方案本身就很复杂,网络运营商很难进行故障排除和调试。与此相反,计算机网络是围绕广为人知的分析和启发式技术建立起来的,这种机制基于众所周知的假设,在不同场景下都有相当好的表现。这些问题并非自动驾驶网络所独有,而是机器学习在自动驾驶汽车等许多关键用例中应用的共同问题。
A fundamental challenge that remains to be addressed towards the deployment of DRL techniques for self-driving networks is their black-box nature. DRL does not provide guaranteed performance for all network scenarios and its operation cannot be understood easily by humans. As a result, DRL-based solutions are inherently complex to troubleshoot and debug by network operators. In contrast, computer networks have been built around well-understood analytical and heuristic techniques, and such mechanisms are based on well-known assumptions that perform reasonably well across different scenarios. Such issues are not unique to self-driving networks, but rather common to the application of machine learning to many critical use-cases, such as self-driving cars.

CRediT 作者贡献声明
CRediT authorship contribution statement

保罗-阿尔马桑方法论、调查、写作--审阅和编辑、构思。何塞-苏亚雷斯-瓦雷拉:概念化。克日什托夫-鲁塞克构思佩雷-巴莱特-罗斯写作--审阅和编辑、监督、构思。阿尔伯特-卡韦略斯-阿帕里西奥写作--审阅和编辑、监督、构思。
Paul Almasan: Methodology, Investigation, Writing - review & editing, Conceptualization. José Suárez-Varela: Conceptualization. Krzysztof Rusek: Conceptualization. Pere Barlet-Ros: Writing - review & editing, Supervision, Conceptualization. Albert Cabellos-Aparicio: Writing - review & editing, Supervision, Conceptualization.

利益冲突声明
Declaration of competing interest

作者声明,他们没有任何可能会影响本文所报告工作的已知经济利益或个人关系。
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

致谢  Acknowledgments

本出版物是西班牙 I+D+i 项目 TRAINER-A 的一部分(编号:PID2020-118011GB-C21),由 MCIN/ AEI/10.13039/501100011033 资助。这项工作还得到了加泰罗尼亚研究和高级研究机构(ICREA)、加泰罗尼亚政府商业和知识部大学和研究秘书处以及欧洲社会基金的部分资助。这项工作还得到了波兰科学和高等教育部(Polish Ministry of Science and Higher Education)的支持,并得到了 AGH 大学计算机科学、电子和电信学院以及 PL-Grid Infrastructure 的资助。
This publication is part of the Spanish I+D+i project TRAINER-A (ref. PID2020-118011GB-C21), funded by MCIN/ AEI/10.13039/501100011033. This work is also partially funded by the Catalan Institution for Research and Advanced Studies (ICREA) and the Secretariat for Universities and Research of the Ministry of Business and Knowledge of the Government of Catalonia and the European Social Fund. This work was also supported by the Polish Ministry of Science and Higher Education with the subvention funds of the Faculty of Computer Science, Electronics and Telecommunications of AGH University and by the PL-Grid Infrastructure.

参考资料  References

[1] R. Hartert, S. Vissicchio, P. Schaus, O. Bonaventure, C. Filsfils, T. Telkamp, P. Francois, A declarative and expressive approach to control forwarding paths in carrier-grade networks, ACM SIGCOMM Comput.Commun.45 (4) (2015) 15-28.
[1] R. Hartert, S. Vissicchio, P. Schaus, O. Bonaventure, C. Filsfils, T. Telkamp, P. Francois, A declarative and expressive approach to control forwarding paths in carrier-grade networks, ACM SIGCOMM Comput. Commun. Rev. 45 (4) (2015) 15-28.
[2] S. Knight,H.X. Nguyen,N. Falkner,R. Bowden,M. Roughan,The internet topology zoo,IEEE J. Sel.Areas Commun.29 (9) (2011) 1765-1775.
[2] S. Knight, H.X. Nguyen, N. Falkner, R. Bowden, M. Roughan, The internet topology zoo, IEEE J. Sel. Areas Commun. 29 (9) (2011) 1765-1775.
[3] V. Mnih、K. Kavukcuoglu、D. Silver、A.A. Rusu、J. Veness、M.G. Bellemare、A. Graves、M. Riedmiller、A.K. Fidjeland、G. Ostrovski 等,《通过深度强化学习实现人类水平控制》,《自然》518(2015)529-533。
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (2015) 529-533.
[4] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al., Mastering chess and shogi by selfplay with a general reinforcement learning algorithm, 2017, arXiv preprint arXiv:1712.01815.
[5] N. Feamster, J. Rexford, Why (and how) networks should run themselves, 2017, arXiv preprint arXiv:1710.11583.
[6] M. Wang, Y. Cui, X. Wang, S. Xiao, J. Jiang, Machine learning for networking:工作流程、进展与机遇》,IEEE Netw. 32 (2) (2017) 92-99。
[6] M. Wang, Y. Cui, X. Wang, S. Xiao, J. Jiang, Machine learning for networking: Workflow, advances and opportunities, IEEE Netw. 32 (2) (2017) 92-99.
[7] A. Mestres、A. Rodriguez-Natal、J. Carner、P. Barlet-Ros、E. Alarcón、M. Solé、V. Muntés-Mulero、D. Meyer、S. Barkai、M.J. Hibbett 等,《知识定义网络》,ACM SIGCOMM Comput.Commun.Rev. 47 (3) (2017) 2-10.
[7] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón, M. Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M.J. Hibbett, et al., Knowledge-defined networking, ACM SIGCOMM Comput. Commun. Rev. 47 (3) (2017) 2-10.
[8] P. Kalmbach, J. Zerwas, P. Babarczi, A. Blenk, W. Kellerer, S. Schmid, Empowering self-driving networks, in:自驾车网络下午研讨会论文集》,2018 年,第 8-14 页。
[8] P. Kalmbach, J. Zerwas, P. Babarczi, A. Blenk, W. Kellerer, S. Schmid, Empowering self-driving networks, in: Proceedings of the Afternoon Workshop on Self-Driving Networks, 2018, pp. 8-14.
[9] A. Valadarsky, M. Schapira, D. Shahaf, A. Tamar, Learning to route, in:ACM 网络热点话题研讨会论文集》,HotNets,2017 年,第 185-191 页。
[9] A. Valadarsky, M. Schapira, D. Shahaf, A. Tamar, Learning to route, in: Proceedings of the ACM Workshop on Hot Topics in Networks, HotNets, 2017, pp. 185-191.
[10] K. Rusek、J. Suárez-Varela、P. Almasan、P. Barlet-Ros、A. Cabellos-Aparicio,RouteNet:Leveraging graph neural networks for network modeling and optimization in SDN, IEEE J. Sel.Areas Commun.38 (10) (2020) 2260-2270.
[10] K. Rusek, J. Suárez-Varela, P. Almasan, P. Barlet-Ros, A. Cabellos-Aparicio, RouteNet: Leveraging graph neural networks for network modeling and optimization in SDN, IEEE J. Sel. Areas Commun. 38 (10) (2020) 2260-2270.
[11] X. Chen、J. Guo、Z. Zhu、R. Proietti、A. Castro、S.J.B. Yoo,Deep-RMSA:用于弹性光网络的深度强化学习路由、调制和频谱分配代理,《光纤通信大会论文集》,2011 年:光纤通信会议论文集,OFC,2018。
[11] X. Chen, J. Guo, Z. Zhu, R. Proietti, A. Castro, S.J.B. Yoo, Deep-RMSA: A deep-reinforcement-learning routing, modulation and spectrum assignment agent for elastic optical networks, in: Proceedings of the Optical Fiber Communications Conference, OFC, 2018.
[12] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C.H. Liu, D. Yang, Experience-driven networking:A deep reinforcement learning based approach, in:IEEE Conference on Computer Communications, INFOCOM, 2018, pp.
[12] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C.H. Liu, D. Yang, Experience-driven networking: A deep reinforcement learning based approach, in: IEEE Conference on Computer Communications, INFOCOM, 2018, pp. 1871-1879.
[13] L. Chen, J. Lingys, K. Chen, F. Liu, Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization, in:ACM 数据通信特别兴趣小组 2018 年会议论文集》,2018 年,第 191-205 页。
[13] L. Chen, J. Lingys, K. Chen, F. Liu, Auto: Scaling deep reinforcement learning for datacenter-scale automatic traffic optimization, in: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, 2018, pp. 191-205.
[14] A. Mestres, E. Alarcón, Y. Ji, A. Cabellos-Aparicio, Understanding the modeling of computer network delays using neural networks, in:ACM SIGCOMM 数据通信网络大数据分析和机器学习研讨会论文集》,Big-DAMA,2018 年,第 46-52 页。
[14] A. Mestres, E. Alarcón, Y. Ji, A. Cabellos-Aparicio, Understanding the modeling of computer network delays using neural networks, in: Proceedings of the ACM SIGCOMM Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, Big-DAMA, 2018, pp. 46-52.
[15] J. Suárez-Varela, A. Mestres, J. Yu, L. Kuang, H. Feng, A. Cabellos-Aparicio, P. Barlet-Ros, Routing in optical transport networks with deep reinforcement learning, IEEE/OSA J. Opt.Commun.Networking 11 (11) (2019) 547-558.
[15] J. Suárez-Varela, A. Mestres, J. Yu, L. Kuang, H. Feng, A. Cabellos-Aparicio, P. Barlet-Ros, Routing in optical transport networks with deep reinforcement learning, IEEE/OSA J. Opt. Commun. Networking 11 (11) (2019) 547-558.
[16] P.W. Battaglia, J.B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., Relational inductive biases, deep learning, and graph networks, 2018, arXiv preprint arXiv: 1806.01261.
[17] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network model, IEEE Trans.20 (1) (2008) 61-80.
[17] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network model, IEEE Trans. Neural Netw. 20 (1) (2008) 61-80.
[18] J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, G.E. Dahl, Neural message passing for quantum chemistry, in:国际机器学习会议论文集-第 70 卷》,ICML,2017 年,第 1263-1272 页。
[18] J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, G.E. Dahl, Neural message passing for quantum chemistry, in: Proceedings of the International Conference on Machine Learning- Vol. 70, ICML, 2017, pp. 1263-1272.
[19] https://github.com/knowledgedefinednetworking/DRL-GNN
[19] https://github.com/knowledgedefinednetworking/DRL-GNN.
[20] Y. Li, D. Tarlow, M. Brockschmidt, R. Zemel, Gated graph sequence neural networks, 2015, arXiv preprint arXiv:1511.05493.
[21] P. Veličković、G. Cucurull、A. Casanova、A. Romero、P. Lio、Y. Bengio, Graph attention networks, 2017, arXiv preprint arXiv:1710.10903.
[21] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, 2017, arXiv preprint arXiv:1710.10903.
[22] P.W. Battaglia, R. Pascanu, M. Lai, D.J. Rezende, et al., Interaction networks for learning about objects, relations and physics, in:神经信息处理系统进展论文集》,NIPS,2016 年,第 4502-4510 页。
[22] P.W. Battaglia, R. Pascanu, M. Lai, D.J. Rezende, et al., Interaction networks for learning about objects, relations and physics, in: Proceedings of Advances in Neural Information Processing Systems, NIPS, 2016, pp. 4502-4510.
[23] C.J.C.H. Watkins, P. Dayan, Q-learning, Mach.Learn.8 (3-4) (1992) 279-292.
[23] C.J.C.H. Watkins, P. Dayan, Q-learning, Mach. Learn. 8 (3-4) (1992) 279-292.
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing atari with deep reinforcement learning, 2013, arXiv preprint arXiv:1312.5602.
[25] J. Kuri, N. Puech, M. Gagnaire, Diverse routing of scheduled lightpath demands in an optical transport network, in:IEEE 可靠通信网络设计国际研讨会论文集》,DRCN,2003 年,第 69-76 页。
[25] J. Kuri, N. Puech, M. Gagnaire, Diverse routing of scheduled lightpath demands in an optical transport network, in: Proceedings of the IEEE International Workshop on Design of Reliable Communication Networks, DRCN, 2003, pp. 69-76.
[26] ITU-T Recommendation g.709/y.1331:光传输网络接口,2019 年,https://www.itu.int/rec/T-REC-G.709/
[26] ITU-T Recommendation g.709/y.1331: Interface for the optical transport network, 2019, https://www.itu.int/rec/T-REC-G.709/.
[27] R.S.萨顿、A.G.巴托:《强化学习》(Reinforcement Learning:导论》,麻省理工学院出版社,2018 年。
[27] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2018.
[28] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation:Encoder-decoder approaches, 2014, arXiv preprint arXiv:1409.1259.
[28] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, 2014, arXiv preprint arXiv:1409.1259.
[29] M. Abadi、P. Barham、J. Chen、Z. Chen、A. Davis、J. Dean、M. Devin、S. Ghemawat、G. Irving、M. Isard 等人,Tensorflow:A system for large-scale machine learning, in:第 12 届 USENIX 操作系统设计与实现研讨会论文集》,OSDI,2016 年,第 265-283 页。
[29] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for large-scale machine learning, in: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2016, pp. 265-283.
[30] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, Openai gym, 2016, arXiv preprint arXiv:1606.01540.
[31] L. Bottou, Large-scale machine learning with stochastic gradient descent, in:国际计算统计会议论文集》,COMPSTAT,2010 年,第 177-186 页。
[31] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Proceedings of the International Conference on Computational Statistics, COMPSTAT, 2010, pp. 177-186.
[32] X. Hei, J. Zhang, B. Bensaou, C.-C. Cheung, Wavelength converter placement in least-load-routing-based optical networks using genetic algorithms, J. Opt.Cheung, Wavelength converter placement in least-load-routing-based optical networks using genetic algorithms, J. Opt.3 (5) (2004) 363-378.
[32] X. Hei, J. Zhang, B. Bensaou, C.-C. Cheung, Wavelength converter placement in least-load-routing-based optical networks using genetic algorithms, J. Opt. Netw. 3 (5) (2004) 363-378.
[33] F. Barreto, E.C. Wille, L. Nacamura Jr., Fast emergency paths schema to overcome transient link failures in OSPF routing, 2012, arXiv preprint arXiv: 1204.2465.
[34] P. Francois, C. Filsfils, J. Evans, O. Bonaventure, Achieving subsecond IGP convergence in large IP networks, ACM SIGCOMM Comput.Commun.35 (3) (2005) 35-44.
[34] P. Francois, C. Filsfils, J. Evans, O. Bonaventure, Achieving sub-second IGP convergence in large IP networks, ACM SIGCOMM Comput. Commun. Rev. 35 (3) (2005) 35-44.
[35] S. Jain、A. Kumar、S. Mandal、J. Ong、L. Poutievski、A. Singh、S. Venkata、J. Wanderer、J. Zhou、M. Zhu 等,B4:全球部署软件定义广域网的经验,ACM SIGCOMM Comput.Commun.Rev. 43 (4) (2013) 3-14。
[35] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, et al., B4: Experience with a globally-deployed software defined WAN, ACM SIGCOMM Comput. Commun. Rev. 43 (4) (2013) 3-14.
[36] A. Hagberg, P. Swart, D. S. Chult, Exploring Network Structure, Dynamics, and Function Using Networkx, Tech.Rep., Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
[36] A. Hagberg, P. Swart, D. S. Chult, Exploring Network Structure, Dynamics, and Function Using Networkx, Tech. Rep., Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
[37] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?2018, arXiv preprint arXiv:1810.00826.
[37] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks? 2018, arXiv preprint arXiv:1810.00826.
[38] L. Gong, X. Zhou, X. Liu, W. Zhao, W. Lu, Z. Zhu, Efficient resource allocation for all-optical multicasting over spectrum-sliced elastic optical networks, IEEE/OSA J. Opt.Commun.Networking 5 (8) (2013) 836-847.
[38] L. Gong, X. Zhou, X. Liu, W. Zhao, W. Lu, Z. Zhu, Efficient resource allocation for all-optical multicasting over spectrum-sliced elastic optical networks, IEEE/OSA J. Opt. Commun. Networking 5 (8) (2013) 836-847.
[39] L. Gong, X. Zhou, W. Lu, Z. Zhu, A two-population based evolutionary approach for optimizing routing, modulation and spectrum assignments (RMSA) in O-OFDM networks, IEEE Commun. Lett.Lett.16 (9) (2012) 1520-1523.
[39] L. Gong, X. Zhou, W. Lu, Z. Zhu, A two-population based evolutionary approach for optimizing routing, modulation and spectrum assignments (RMSA) in O-OFDM networks, IEEE Commun. Lett. 16 (9) (2012) 1520-1523.
[40] M. Klinkowski, M. Ruiz, L. Velasco, D. Careglio, V. Lopez, J. Comellas, Elastic spectrum allocation for time-varying traffic in flexgrid optical networks, IEEE J. Sel.Areas Commun.31 (1) (2012) 26-38.
[40] M. Klinkowski, M. Ruiz, L. Velasco, D. Careglio, V. Lopez, J. Comellas, Elastic spectrum allocation for time-varying traffic in flexgrid optical networks, IEEE J. Sel. Areas Commun. 31 (1) (2012) 26-38.
[41] Y. Wang, X. Cao, Y. Pan, A study of the routing and spectrum allocation in spectrum-sliced elastic optical path networks, in:IEEE International Conference on Computer Communications, INFOCOM, 2011, pp.
[41] Y. Wang, X. Cao, Y. Pan, A study of the routing and spectrum allocation in spectrum-sliced elastic optical path networks, in: IEEE International Conference on Computer Communications, INFOCOM, 2011, pp. 1503-1511.
[42] K. Christodoulopoulos, I. Tomkos, E.A. Varvarigos, Elastic bandwidth allocation in flexible OFDM-based optical networks, J. Lightwave Technol.29 (9) (2011) 1354-1366.
[42] K. Christodoulopoulos, I. Tomkos, E.A. Varvarigos, Elastic bandwidth allocation in flexible OFDM-based optical networks, J. Lightwave Technol. 29 (9) (2011) 1354-1366.
[43] P. Sun、Z. Guo、J. Lan、J. Li、Y. Hu、T. Baker,Scaledrl:A scalable deep reinforcement learning approach for traffic engineering in SDN with pinning control, Comput.190 (2021) 107891.
[43] P. Sun, Z. Guo, J. Lan, J. Li, Y. Hu, T. Baker, Scaledrl: A scalable deep reinforcement learning approach for traffic engineering in SDN with pinning control, Comput. Netw. 190 (2021) 107891.
[44] J. Zhang, M. Ye, Z. Guo, C.-Y.Yen, H.J. Chao, CFR-RL:Traffic engineering with reinforcement learning in SDN, IEEE J. Sel.Areas Commun.38 (10) (2020) 2249-2259, http://dx.doi.org/10.1109/JSAC.2020.3000371.
[44] J. Zhang, M. Ye, Z. Guo, C.-Y. Yen, H.J. Chao, CFR-RL: Traffic engineering with reinforcement learning in SDN, IEEE J. Sel. Areas Commun. 38 (10) (2020) 2249-2259, http://dx.doi.org/10.1109/JSAC.2020.3000371.
[45] S. Troia、F. Sapienza、L. Varé、G. Maier,On deep reinforcement learning for traffic engineering in SD-WAN,IEEE J. Sel.Areas Commun.39 (7) (2020) 2198-2212.
[45] S. Troia, F. Sapienza, L. Varé, G. Maier, On deep reinforcement learning for traffic engineering in SD-WAN, IEEE J. Sel. Areas Commun. 39 (7) (2020) 2198-2212.
[46] F. Geyer, G. Carle, Learning and Generating distributed routing protocols using graph-based deep learning, in:ACM SIGCOMM 数据通信网络大数据分析与机器学习研讨会论文集》,Big-DAMA,2018 年,第 40-45 页。
[46] F. Geyer, G. Carle, Learning and generating distributed routing protocols using graph-based deep learning, in: Proceedings of the ACM SIGCOMM Workshop on Big Data Analytics and Machine Learning for Data Communication Networks, Big-DAMA, 2018, pp. 40-45.
[47] H. Zhu, V. Gupta, S.S. Ahuja, Y. Tian, Y. Zhang, X. Jin, Network Planning with Deep reinforcement learning, in:Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021, pp.
[47] H. Zhu, V. Gupta, S.S. Ahuja, Y. Tian, Y. Zhang, X. Jin, Network planning with deep reinforcement learning, in: Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021, pp. 258-271.
[48] G. Bernárdez、J. Suárez-Varela、A. López、B. Wu、S. Xiao、X. Cheng、P. BarletRos、A. Cabellos-Aparicio, Is machine learning ready for traffic engineering optimization? in:2021 IEEE 29th International Conference on Network Protocols, ICNP, IEEE, 2021, pp.
[48] G. Bernárdez, J. Suárez-Varela, A. López, B. Wu, S. Xiao, X. Cheng, P. BarletRos, A. Cabellos-Aparicio, Is machine learning ready for traffic engineering optimization? in: 2021 IEEE 29th International Conference on Network Protocols, ICNP, IEEE, 2021, pp. 1-11.
[49] H. Mao, M. Schwarzkopf, S.B. Venkatakrishnan, Z. Meng, M. Alizadeh, Learning scheduling algorithms for data processing clusters, in:Proceedings of ACM SIGCOMM, 2019, pp.
[49] H. Mao, M. Schwarzkopf, S.B. Venkatakrishnan, Z. Meng, M. Alizadeh, Learning scheduling algorithms for data processing clusters, in: Proceedings of ACM SIGCOMM, 2019, pp. 270-288.

    • 通讯作者: Corresponding author.
    电子邮件地址:felician.paul.almasan@upc.edu (P. Almasan)、jose.suarez-varela@upc.edu (J. Suárez-Varela), krusek@agh.edu.pl (K. Rusek), krusek@agh.edu.pl (P. Barlet-Ros), felician.paul.almasan@upc.eduRusek), pere.barlet@upc.edu (P. Barlet-Ros), alberto.cabellos@upc.edu (A. Cabellos-Aparicio).
    E-mail addresses: felician.paul.almasan@upc.edu (P. Almasan), jose.suarez-varela@upc.edu (J. Suárez-Varela), krusek@agh.edu.pl (K. Rusek), pere.barlet@upc.edu (P. Barlet-Ros), alberto.cabellos@upc.edu (A. Cabellos-Aparicio).
    https://doi.org/10.1016/j.comcom.2022.09.029
    2021 年 12 月 3 日收到;2022 年 6 月 22 日收到修订稿;2022 年 9 月 29 日接受
    Received 3 December 2021; Received in revised form 22 June 2022; Accepted 29 September 2022
    2022 年 10 月 6 日在线提供
    Available online 6 October 2022
    0140-3664/© 2022 Elsevier B.V. 保留所有权利。保留所有权利。
    0140-3664/© 2022 Elsevier B.V. All rights reserved.
  1. 1 1 ^(1){ }^{1} 请注意,基于迁移学习的解决方案并不具备这一特性,因为 DRL 代理需要在其最终运行的网络上重新训练。
    1 1 ^(1){ }^{1} Note that solutions based on transfer learning do not offer this property, as DRL agents need to be re-trained on the network where they finally operate.