这是用户在 2024-9-30 15:27 为 https://app.immersivetranslate.com/pdf-pro/d3ce149a-876d-4a3e-bfcd-f901e5d749a6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Reinforcement learning in reliability and maintenance optimization: A tutorial
可靠性和维护优化中的强化学习:教程

Qin Zhang a a ^(a){ }^{a}, Yu Liu a , b , a , b , ^(a,b,^(**)){ }^{a, b,{ }^{*}}, Yisha Xiang c c ^(c){ }^{\mathrm{c}}, Tangfan Xiahou a , b a , b ^(a,b){ }^{\mathrm{a}, \mathrm{b}}
张琴 a a ^(a){ }^{a} , 刘宇 a , b , a , b , ^(a,b,^(**)){ }^{a, b,{ }^{*}} , 向怡莎 c c ^(c){ }^{\mathrm{c}} , 夏侯唐凡 a , b a , b ^(a,b){ }^{\mathrm{a}, \mathrm{b}}
a a ^(a){ }^{a} School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, PR China
中国电子科技大学机械与电气工程学院,四川省成都市 611731,中华人民共和国
b b ^(b){ }^{\mathrm{b}} Center for System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, PR China
中国电子科技大学系统可靠性与安全中心,四川省成都市 611731,中华人民共和国
c ^("c "){ }^{\text {c }} Department of Industrial Engineering, University of Houston, Houston, TX 77004, USA
休斯顿大学工业工程系,美国德克萨斯州休斯顿,邮政编码 77004

A R T I C L E I N F O
文章信息

Keywords: 关键词:

Markov decision process 马尔可夫决策过程
Reinforcement learning 强化学习
Reliability optimization
可靠性优化

Maintenance optimization
维护优化

Abstract 摘要

A B S T R A C T The increasing complexity of engineering systems presents significant challenges in addressing intricate reliability and maintenance optimization problems. Advanced computational techniques have become imperative. Reinforcement learning, with its strong capability of solving complicated sequential decision-making problems under uncertainty, has emerged as a powerful tool to reliability and maintainability community. This paper offers a step-by-step guideline on the utilization of reinforcement learning algorithms for resolving reliability optimization and maintenance optimization problems. We first introduce the Markov decision process modeling framework for these problems and elucidate the design and implementation of solution algorithms, including dynamic programming, reinforcement learning, and deep reinforcement learning. Case studies, including a pipeline system mission abort optimization and a manufacturing system condition-based maintenance decisionmaking, are included to demonstrate the utility of reinforcement learning in reliability and maintenance applications. This tutorial will assist researchers in the reliability and maintainability community by summarizing the state-of-the-art reinforcement learning algorithms and providing the hand-on implementations in reliability and maintenance optimization problems.
摘要 工程系统日益复杂,给解决复杂的可靠性和维护优化问题带来了重大挑战。先进的计算技术已变得不可或缺。强化学习凭借其在不确定性下解决复杂序列决策问题的强大能力,已成为可靠性和可维护性领域的有力工具。本文提供了利用强化学习算法解决可靠性优化和维护优化问题的逐步指南。我们首先介绍了这些问题的马尔可夫决策过程建模框架,并阐明了解决算法的设计和实现,包括动态规划、强化学习和深度强化学习。案例研究包括管道系统任务中止优化和制造系统基于状态的维护决策,旨在展示强化学习在可靠性和维护应用中的实用性。 本教程将通过总结最先进的强化学习算法并提供在可靠性和维护优化问题中的实际实现,帮助可靠性和可维护性领域的研究人员。

1. Introduction 1. 引言

Engineered systems inevitably experience degradation, leading to unexpected shutdowns and significant adverse consequences [1,2]. Reliability and maintenance optimization, which aims to retain and improve the system reliability through intervention efforts, including maintenance, mission abort, and system reconfiguration, are of paramount concern throughout the operation of any system [3-5]. As a system gradually degrades over time, reliability and maintenance optimization are typically sequential decision-making processes, where the optimal decisions have to be made in response to the dynamic changes in the system’s state [6,7]. Markov decision process (MDP) provides an analytic framework for reliability and maintenance optimization, in which the system state degrades over operation and the optimal actions of each state are identified [8]. However, solving MDPs typically faces scalability and dimensionality issues. The computational burden associated with traditional algorithms (e.g., dynamic programming) significantly increases as the problem size grows, thus limiting their applicability in engineering instances.
工程系统不可避免地会经历退化,导致意外停机和显著的不利后果 [1,2]。可靠性和维护优化旨在通过干预措施(包括维护、任务中止和系统重配置)保持和提高系统可靠性,这在任何系统的运行过程中都是至关重要的关注点 [3-5]。随着系统随时间逐渐退化,可靠性和维护优化通常是顺序决策过程,必须根据系统状态的动态变化做出最佳决策 [6,7]。马尔可夫决策过程(MDP)为可靠性和维护优化提供了分析框架,其中系统状态在运行中退化,并识别每个状态的最佳行动 [8]。然而,解决 MDP 通常面临可扩展性和维度问题。随着问题规模的增长,传统算法(例如动态规划)所带来的计算负担显著增加,从而限制了它们在工程实例中的适用性。

Reinforcement learning (RL), a critical paradigm in the field of machine learning, has emerged as a powerful and effective tool for solving MDPs. RL is an efficient framework that focuses on sequential decisionmaking problems to determine the optimal actions to maximize cumulative rewards [9]. The concept of RL can be traced back to the early works in behavioral psychology and neuroscience regarding how animals learn behaviors through interactions with environment, inspired by Pavlovian and instrumental conditioning [10]. RL utilizes a trial-and-error learning paradigm, gaining knowledge through interactions with environment and shaping behavior with rewards and punishments. Its powerful mechanism for learning optimal actions from interaction experiences enables it to operate in a model-free manner and eliminate the need for pre-labeled data. This attribute confers a high degree of adaptability and flexibility across various complicated engineering scenarios. The RL paradigm reflects advances in computational capabilities and algorithmic strategies that have overcome limitations related to scalability and dimensionality in MDPs. Additionally, the advances in deep learning have further enhanced RL, culminating in the transformative paradigm known as deep reinforcement learning (DRL).
强化学习(RL)是机器学习领域中的一个关键范式,已成为解决马尔可夫决策过程(MDP)的强大有效工具。RL 是一个高效的框架,专注于顺序决策问题,以确定最大化累积奖励的最佳行动。RL 的概念可以追溯到早期的行为心理学和神经科学研究,探讨动物如何通过与环境的互动学习行为,受到巴甫洛夫条件反射和工具性条件反射的启发。RL 利用试错学习范式,通过与环境的互动获得知识,并通过奖励和惩罚塑造行为。其从互动经验中学习最佳行动的强大机制使其能够以无模型的方式运作,消除了对预标记数据的需求。这一特性赋予了其在各种复杂工程场景中的高度适应性和灵活性。RL 范式反映了计算能力和算法策略的进步,克服了与 MDP 相关的可扩展性和维度限制。 此外,深度学习的进步进一步增强了强化学习,最终形成了被称为深度强化学习(DRL)的变革性范式。
DRL leverages the efficient function approximation and generalization capabilities of deep neural networks to handle high-dimensional and continuous state spaces effectively. Furthermore, an additional advantage of integrating with deep neural networks is the capability to process diverse types of input data, including images, text, and audio, thereby extending DRL’s applicability to a broader spectrum of engineering scenarios. The strong capabilities of RL and DRL in addressing complicated and large-scale problems has rendered them among the most prevalent tools to the reliability and maintainability community [11]. In this article, we aim to offer a step-by-step tutorial on RL for reliability and maintenance optimization. Starting with the MDP formulation for reliability and maintenance optimization problems, this tutorial elucidates the design and implementation of RL algorithms and provides pointers to several case studies in engineering applications.
DRL 利用深度神经网络高效的函数逼近和泛化能力,有效处理高维和连续状态空间。此外,深度神经网络的另一个优势是能够处理多种类型的输入数据,包括图像、文本和音频,从而将 DRL 的适用性扩展到更广泛的工程场景。强化学习(RL)和深度强化学习(DRL)在解决复杂和大规模问题方面的强大能力,使其成为可靠性和可维护性领域中最常用的工具之一[11]。在本文中,我们旨在提供一个关于可靠性和维护优化的 RL 逐步教程。该教程从可靠性和维护优化问题的马尔可夫决策过程(MDP)公式化开始,阐明了 RL 算法的设计和实现,并提供了多个工程应用案例研究的指引。
The rest of this article are rolled out as follows. Section 2 provides an overview of RL and DRL. Section 3 presents the application of RL in reliability optimization, and Section 4 introduces its application in maintenance optimization. Section 5 closes this tutorial.
本文的其余部分安排如下。第二节提供了强化学习(RL)和深度强化学习(DRL)的概述。第三节介绍了强化学习在可靠性优化中的应用,第四节介绍了其在维护优化中的应用。第五节结束本教程。

2. Reinforcement learning and deep reinforcement learning
2. 强化学习与深度强化学习

2.1. Markov decision process
2.1. 马尔可夫决策过程

MDP is a classical framework for sequential decision-making problems, and it is defined on sequences of states and actions and driven by Markov chains. In the sequential decision process, an action is selected in each state which will impact not just immediate reward, but also subsequent states and future rewards [12]. The goal of an MDP is to develop a policy that selects the optimal actions of each state to maximize the expected accumulated rewards over time. MDPs have widespread applications in engineering scenarios and can be traced back to the pioneer work of Richard Bellman [13]. An MDP can be described by a tuple of five elements ( S , A , P , R , γ ) ( S , A , P , R , γ ) (S,A,P,R,gamma)(\mathscr{S}, \mathscr{A}, P, R, \gamma), where S S S\mathscr{S} is the state space containing all the states, A A A\mathscr{A} is the action space with a set of feasible actions, P : S × A [ 0 , 1 ] P : S × A [ 0 , 1 ] P:SxxArarr[0,1]P: \mathscr{S} \times \mathscr{A} \rightarrow[0,1] is the transition probability matrix which characterizes the dynamics of the environment, R R RR is the reward function, and γ γ gamma in\gamma \in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] is the discount factor that is used to discount future rewards and thus focuses on immediate versus long-term rewards. Note that the state of a system typically represents the specific condition of the environment or system being modeled, and it encapsulates all the relevant information needed to drive the decisions.
MDP 是一个经典的序列决策问题框架,它在状态和动作的序列上定义,并由马尔可夫链驱动。在序列决策过程中,在每个状态中选择一个动作,这不仅会影响即时奖励,还会影响后续状态和未来奖励[12]。MDP 的目标是制定一个策略,以选择每个状态的最佳动作,从而最大化随时间累积的期望奖励。MDP 在工程场景中有广泛的应用,可以追溯到理查德·贝尔曼的开创性工作[13]。一个 MDP 可以用五个元素的元组 ( S , A , P , R , γ ) ( S , A , P , R , γ ) (S,A,P,R,gamma)(\mathscr{S}, \mathscr{A}, P, R, \gamma) 来描述,其中 S S S\mathscr{S} 是包含所有状态的状态空间, A A A\mathscr{A} 是具有一组可行动作的动作空间, P : S × A [ 0 , 1 ] P : S × A [ 0 , 1 ] P:SxxArarr[0,1]P: \mathscr{S} \times \mathscr{A} \rightarrow[0,1] 是描述环境动态的转移概率矩阵, R R RR 是奖励函数,而 γ γ gamma in\gamma \in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] 是用于折现未来奖励的折现因子,从而关注即时奖励与长期奖励的关系。 请注意,系统的状态通常表示所建模环境或系统的特定条件,并封装了驱动决策所需的所有相关信息。
At each decision step t t tt, a decision maker selects an action a t a t a_(t)a_{t} for the current state s t s t s_(t)s_{t}, and the system transitions to a new state s t + 1 s t + 1 s_(t+1)s_{t+1} following the transition probability P ( s t + 1 s t , a t ) P s t + 1 s t , a t P(s_(t+1)∣s_(t),a_(t))P\left(s_{t+1} \mid s_{t}, a_{t}\right) and obtain reward r t r t r_(t)r_{t}. A solution to an MDP is provided in terms of a policy π π pi\pi, which maps each state to an action to take in this state. A deterministic policy maps states to actions, such that it prescribes which action to take in each state π : S A . A π : S A . A pi:SrarrA.A\pi: \mathscr{S} \rightarrow \mathscr{A} . \mathrm{A} stochastic policy prescribes the probability of each action in each state π : S × A [ 0 , 1 ] π : S × A [ 0 , 1 ] pi:SxxArarr[0,1]\pi: \mathscr{S} \times \mathscr{A} \rightarrow[0,1], where π ( a t s t ) π a t s t pi(a_(t)∣s_(t))\pi\left(a_{t} \mid s_{t}\right) is the probability of selecting action a t a t a_(t)a_{t} in state s t s t s_(t)s_{t}. The goal of MDP is to obtain an optimal policy that maximizes the expected accumulated rewards. MDPs can be defined over a finite or infinite horizon. The former is characterized by their episodic nature, encompassing systems with clearly delineated terminal states, thus confining the decision-making process to a predefined temporal boundary. Conversely, the latter, lacking explicit terminal states, represents systems where the decision-making extends indefinitely, thereby accommodating continuous adaptation to evolving process.
在每个决策步骤 t t tt 中,决策者为当前状态 s t s t s_(t)s_{t} 选择一个动作 a t a t a_(t)a_{t} ,系统根据转移概率 P ( s t + 1 s t , a t ) P s t + 1 s t , a t P(s_(t+1)∣s_(t),a_(t))P\left(s_{t+1} \mid s_{t}, a_{t}\right) 转移到新状态 s t + 1 s t + 1 s_(t+1)s_{t+1} 并获得奖励 r t r t r_(t)r_{t} 。马尔可夫决策过程(MDP)的解决方案以策略 π π pi\pi 的形式提供,该策略将每个状态映射到在该状态下采取的动作。确定性策略将状态映射到动作,从而规定在每个状态下应采取的动作 π : S A . A π : S A . A pi:SrarrA.A\pi: \mathscr{S} \rightarrow \mathscr{A} . \mathrm{A} ,而随机策略则规定在每个状态下每个动作的概率 π : S × A [ 0 , 1 ] π : S × A [ 0 , 1 ] pi:SxxArarr[0,1]\pi: \mathscr{S} \times \mathscr{A} \rightarrow[0,1] ,其中 π ( a t s t ) π a t s t pi(a_(t)∣s_(t))\pi\left(a_{t} \mid s_{t}\right) 是选择在状态 s t s t s_(t)s_{t} 中采取动作 a t a t a_(t)a_{t} 的概率。MDP 的目标是获得一个最优策略,以最大化期望的累积奖励。MDP 可以定义在有限或无限的时间范围内。前者的特点是其情节性,包含具有明确终止状态的系统,从而将决策过程限制在预定义的时间边界内。相反,后者缺乏明确的终止状态,代表了决策过程无限延续的系统,从而适应不断发展的过程。
The value function of state s s ss under policy π π pi\pi is the expected accumulated rewards when starting in s s ss and following π π pi\pi, and it is represented as:
在策略 π π pi\pi 下,状态 s s ss 的价值函数是从 s s ss 开始并遵循 π π pi\pi 时的期望累积奖励,表示为:

v π ( s ) = E [ k = 0 γ k r t + k + 1 s t = s ] , s S v π ( s ) = E k = 0 γ k r t + k + 1 s t = s , s S v_(pi)(s)=E[sum_(k=0)^(oo)gamma^(k)r_(t+k+1)∣s_(t)=s],AA s inSv_{\pi}(s)=\mathbb{E}\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1} \mid s_{t}=s\right], \forall s \in \mathscr{S}
Similar to the value functions, the action-value function q π ( s , a ) q π ( s , a ) q_(pi)(s,a)q_{\pi}(s, a) defines the expected accumulated rewards of taking action a a aa in state s s ss and follow policy π π pi\pi onwards, and it is formulated as:
类似于价值函数,动作价值函数 q π ( s , a ) q π ( s , a ) q_(pi)(s,a)q_{\pi}(s, a) 定义了在状态 s s ss 下采取动作 a a aa 并遵循策略 π π pi\pi 之后的预期累积奖励,其公式为:

q π ( s , a ) = E [ k = 0 γ k r t + k + 1 s t = s , a t = a ] , s S , a A q π ( s , a ) = E k = 0 γ k r t + k + 1 s t = s , a t = a , s S , a A q_(pi)(s,a)=E[sum_(k=0)^(oo)gamma^(k)r_(t+k+1)∣s_(t)=s,a_(t)=a],AA s inS,AA a inAq_{\pi}(s, a)=\mathbb{E}\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1} \mid s_{t}=s, a_{t}=a\right], \forall s \in \mathscr{S}, \forall a \in \mathscr{A}.
The value function and action-value function can be formulated as the following recursive forms:
价值函数和动作价值函数可以表述为以下递归形式:

v π ( s ) = a A π ( a s ) [ r ( s , a ) + γ s S P ( s s , a ) v π ( s ) ] , s S v π ( s ) = a A π ( a s ) r ( s , a ) + γ s S P s s , a v π s , s S v_(pi)(s)=sum_(a inA)pi(a∣s)[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(pi)(s^('))],AA s inSv_{\pi}(s)=\sum_{a \in \mathscr{A}} \pi(a \mid s)\left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{\pi}\left(s^{\prime}\right)\right], \forall s \in \mathscr{S}
q π ( s , a ) = r ( s , a ) + γ s P ( s s , a ) v π ( s ) , s S , a A q π ( s , a ) = r ( s , a ) + γ s P s s , a v π s , s S , a A q_(pi)(s,a)=r(s,a)+gammasum_(s^('))P(s^(')∣s,a)v_(pi)(s^(')),AA s inS,AA a inAq_{\pi}(s, a)=r(s, a)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) v_{\pi}\left(s^{\prime}\right), \forall s \in \mathscr{S}, \forall a \in \mathscr{A}
The optimal solution to an MDP refers to the policy π π pi^(**)\pi^{*} that generates the maximal expected accumulated reward over decision horizon. The optimal value function, denoted as v ( s ) v ( s ) v^(**)(s)v^{*}(s), formulates the expected accumulated reward of taking action a a aa in state s s ss following the optimal policy π π pi^(**)\pi^{*} and can be estimated by solving the well-known Bellman optimality equation [11]:
MDP 的最优解指的是策略 π π pi^(**)\pi^{*} ,它在决策时间范围内产生最大的期望累积奖励。最优价值函数,记作 v ( s ) v ( s ) v^(**)(s)v^{*}(s) ,公式化了在状态 s s ss 下,遵循最优策略 π π pi^(**)\pi^{*} 采取行动 a a aa 的期望累积奖励,并可以通过求解著名的贝尔曼最优性方程来估计[11]:
v ( s ) = max a { r ( s , a ) + γ s S P ( s s , a ) v ( s ) } , s S v ( s ) = max a r ( s , a ) + γ s S P s s , a v s , s S v^(**)(s)=max_(a){r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v^(**)(s^('))},AA s inSv^{*}(s)=\max _{a}\left\{r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v^{*}\left(s^{\prime}\right)\right\}, \forall s \in \mathscr{S}
Similar to the optimal value function, the Bellman optimality equation of the optimal action-value q ( s , a ) q ( s , a ) q^(**)(s,a)q^{*}(s, a) can be written as:
与最优值函数类似,最优动作值的贝尔曼最优性方程可以写成:

q ( s , a ) = r ( s , a ) + γ s S P ( s s , a ) max a { q ( s , a ) } , s S , a A q ( s , a ) = r ( s , a ) + γ s S P s s , a max a q s , a , s S , a A q^(**)(s,a)=r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)max_(a^(')){q^(**)(s^('),a^('))},AA s inS,AA a inAq^{*}(s, a)=r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) \max _{a^{\prime}}\left\{q^{*}\left(s^{\prime}, a^{\prime}\right)\right\}, \forall s \in \mathscr{S}, \forall a \in \mathscr{A}.
Based on the optimal value function v ( s ) v ( s ) v^(**)(s)v^{*}(s) and action-value function q ( s , a ) q ( s , a ) q^(**)(s,a)q^{*}(s, a), the optimal policy π π pi^(**)\pi^{*} can be directly estimated by:
基于最优值函数 v ( s ) v ( s ) v^(**)(s)v^{*}(s) 和动作值函数 q ( s , a ) q ( s , a ) q^(**)(s,a)q^{*}(s, a) ,最优策略 π π pi^(**)\pi^{*} 可以直接通过以下方式估计:

π ( s ) = { argmax a { r ( s , a ) + γ s S P ( s s , a ) v ( s ) } . argmax or q ( s , a ) π ( s ) = argmax a r ( s , a ) + γ s S P s s , a v s . argmax or q ( s , a ) pi^(**)(s)={[argmax _(a){r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v^(**)(s^('))}.],[argmax _(or)q^(**)(s","a)]:}\pi^{*}(s)=\left\{\begin{array}{l}\underset{a}{\operatorname{argmax}}\left\{r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v^{*}\left(s^{\prime}\right)\right\} . \\ \underset{\operatorname{or}}{\operatorname{argmax}} q^{*}(s, a)\end{array}\right.
Dynamic programming (DP) provides an exact solution framework for MDPs, based on the premise that a perfect model of the environment is available. The DP methods leverage the Bellman optimality equation, decomposing the value function into immediate rewards and discounted future values, thus facilitating a recursive formulation for estimating the optimal policy π ( s ) π ( s ) pi^(**)(s)\pi^{*}(s). The two most popular methods in DP are policy iteration and value iteration. Policy iteration searches for the optimal policy by two iterative steps: policy evaluation and policy improvement. Policy evaluation iteratively evaluates the value function for all states under policy π π pi\pi using the Bellman equation:
动态规划(DP)为马尔可夫决策过程(MDP)提供了一个精确的解决框架,前提是环境的完美模型可用。DP 方法利用贝尔曼最优性方程,将价值函数分解为即时奖励和折扣未来值,从而促进了估计最优策略的递归公式 π ( s ) π ( s ) pi^(**)(s)\pi^{*}(s) 。DP 中最流行的两种方法是策略迭代和价值迭代。策略迭代通过两个迭代步骤搜索最优策略:策略评估和策略改进。策略评估在策略 π π pi\pi 下迭代评估所有状态的价值函数,使用贝尔曼方程:

v k + 1 ( s ) a A π ( a s ) [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) a A π ( a s ) r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larrsum_(a inA)pi(a∣s)[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \sum_{a \in \mathscr{A}} \pi(a \mid s)\left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right], for all s S s S s inSs \in \mathscr{S},
v k + 1 ( s ) a A π ( a s ) [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) a A π ( a s ) r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larrsum_(a inA)pi(a∣s)[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \sum_{a \in \mathscr{A}} \pi(a \mid s)\left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right] ,对于所有 s S s S s inSs \in \mathscr{S}

followed by a policy improvement step where the policy is updated by choosing actions that maximize the expected utility:
接着是一个政策改进步骤,在该步骤中,通过选择最大化期望效用的行动来更新政策:

π ( s ) argmax a { r ( s , a ) + γ s S P ( s s , a ) v π ( s ) } π ( s ) argmax a r ( s , a ) + γ s S P s s , a v π s pi(s)larrargmax _(a){r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(pi)(s^('))}\pi(s) \leftarrow \underset{a}{\operatorname{argmax}}\left\{r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{\pi}\left(s^{\prime}\right)\right\}.
Value iteration directly estimate the value function for each state using the Bellman optimality equation until it converges to the optimal value function. After identifying the optimal value function, the optimal policy can be trivially extracted by straightforward traversing through the states and identifying the actions corresponding to the highest values. It can be written as a simple update form that combines the policy improvement and truncated policy evaluation steps:
值迭代直接使用贝尔曼最优性方程估计每个状态的价值函数,直到其收敛到最优价值函数。在识别出最优价值函数后,可以通过简单遍历状态并识别对应于最高值的动作来轻松提取最优策略。它可以写成一个简单的更新形式,结合了策略改进和截断策略评估步骤:

v k + 1 ( s ) max [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) max r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larr max[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \max \left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right], for all s S s S s inSs \in \mathscr{S}.
v k + 1 ( s ) max [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) max r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larr max[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \max \left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right] ,对于所有 s S s S s inSs \in \mathscr{S}
Nevertheless, DP methods are well developed mathematically, the foundational assumption of a complete and accurate model of the environment is oftentimes unrealistic in real-world scenarios [12]. This idealized assumption overlooks inherent variances and uncertainties in dynamic environments. On the other hand, the computational time of DP methods grows exponentially with respect to the dimensionality of state and action spaces, leading to a phenomenon termed the curse of dimensionality [14]. The considerable computational burden associated with addressing large-scale MDPs renders classical DP methods impractical for high-dimensional or complex problems in practical applications [15].
尽管动态规划(DP)方法在数学上发展成熟,但在现实场景中,关于环境的完整和准确模型的基础假设往往是不切实际的。这种理想化的假设忽视了动态环境中固有的变异性和不确定性。另一方面,DP 方法的计算时间随着状态和动作空间的维度呈指数增长,导致了所谓的维度诅咒现象。与解决大规模马尔可夫决策过程(MDP)相关的巨大计算负担使得经典的 DP 方法在实际应用中对于高维或复杂问题变得不切实际。

2.2. Reinforcement learning
2.2. 强化学习

RL provides an efficient paradigm to address large-scale MDPs through continuous interactions between the agent and environment, as depicted in Fig. 1. The agent selects actions based on the observed states of the environment, with each action resulting in a reward and the next state of the environment. RL is learning what to do, i.e., how to map situations to actions, so as to maximize the reward signals [13]. RL strikes a balance between exploring new opportunities and exploiting known actions through a trial-and-error search guided by a scalar reward signal through experiences and interactions within a specified environment, thereby adapting to a dynamic environment without necessitating a predefined model [9]. The unique learning mechanism eliminates the need for pre-labeled data and distinguishes RL apart from other machine learning methods, demonstrating promising potential for sequential decision-making. RL originated from the foundational research in psychology and neuroscience, particularly those focusing on trial-and-error learning processes in animals. The evolution of RL as a distinct research area was significantly accelerated following Richard Bellman’s pioneering works [16], which provided a systematic approach to solving sequential decision-making problems. Monte Carlo methods and temporal-difference learning are two fundamental classes of RL approaches for estimating value functions and discovering optimal policies. Monte Carlo methods estimate the value of a state by simply averaging the sample returns (cumulative rewards) observed after visiting that state. As more returns are observed, the average should converge to the expected value. Monte Carlo methods require only experience, i.e., sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. A simple Monte Carlo method can be formulated as:
强化学习(RL)提供了一种高效的范式,通过代理与环境之间的持续互动来解决大规模马尔可夫决策过程(MDP),如图 1 所示。代理根据观察到的环境状态选择行动,每个行动都会产生奖励和环境的下一个状态。RL 学习如何行动,即如何将情境映射到行动,以最大化奖励信号[13]。RL 在探索新机会和利用已知行动之间取得平衡,通过经验和在特定环境中的互动,借助标量奖励信号进行试错搜索,从而适应动态环境,而无需预定义模型[9]。独特的学习机制消除了对预标记数据的需求,使 RL 与其他机器学习方法区分开来,展示了在序列决策中的良好潜力。RL 起源于心理学和神经科学的基础研究,特别是那些关注动物试错学习过程的研究。 强化学习作为一个独立研究领域的演变在理查德·贝尔曼的开创性工作之后得到了显著加速,这些工作提供了一个系统的方法来解决序列决策问题。蒙特卡洛方法和时间差分学习是两种基本的强化学习方法,用于估计价值函数和发现最优策略。蒙特卡洛方法通过简单地对访问该状态后观察到的样本回报(累计奖励)进行平均来估计状态的价值。随着更多回报的观察,平均值应收敛于期望值。蒙特卡洛方法仅需要经验,即来自实际或模拟与环境交互的状态、动作和奖励的样本序列。一个简单的蒙特卡洛方法可以表述为:

v ( s t ) v ( s t ) + α [ k = 1 γ r t + k v ( s t ) ] v s t v s t + α k = 1 γ r t + k v s t v(s_(t))larr v(s_(t))+alpha[sum_(k=1)^(oo)gamma*r_(t+k)-v(s_(t))]v\left(s_{t}\right) \leftarrow v\left(s_{t}\right)+\alpha\left[\sum_{k=1}^{\infty} \gamma \cdot r_{t+k}-v\left(s_{t}\right)\right].
where k = 1 γ r t + k k = 1 γ r t + k sum_(k=1)^(oo)gamma*r_(t+k)\sum_{k=1}^{\infty} \gamma \cdot r_{t+k} denotes the return starting from time t t tt, and α α alpha\alpha is a constant step-size parameter.
其中 k = 1 γ r t + k k = 1 γ r t + k sum_(k=1)^(oo)gamma*r_(t+k)\sum_{k=1}^{\infty} \gamma \cdot r_{t+k} 表示从时间 t t tt 开始的回报,而 α α alpha\alpha 是一个常数步长参数。
Temporal-difference learning is a combination of the sampling of Monte Carlo methods and the bootstrapping of DP. It updates estimates incrementally after each step when reward r t r t r_(t)r_{t} and the next state s t + 1 s t + 1 s_(t+1)s_{t+1} are obverted. The simplest temporal-difference method possesses the following updating form: v ( s t ) v ( s t ) + α [ r t + γ v ( s t + 1 ) v ( s t ) ] v s t v s t + α r t + γ v s t + 1 v s t v(s_(t))larr v(s_(t))+alpha[r_(t)+gamma*v(s_(t+1))-v(s_(t))]v\left(s_{t}\right) \leftarrow v\left(s_{t}\right)+\alpha\left[r_{t}+\gamma \cdot v\left(s_{t+1}\right)-v\left(s_{t}\right)\right].
时间差学习是蒙特卡罗方法的采样与动态规划的自举相结合。它在每一步观察到奖励 r t r t r_(t)r_{t} 和下一个状态 s t + 1 s t + 1 s_(t+1)s_{t+1} 后,逐步更新估计。最简单的时间差方法具有以下更新形式: v ( s t ) v ( s t ) + α [ r t + γ v ( s t + 1 ) v ( s t ) ] v s t v s t + α r t + γ v s t + 1 v s t v(s_(t))larr v(s_(t))+alpha[r_(t)+gamma*v(s_(t+1))-v(s_(t))]v\left(s_{t}\right) \leftarrow v\left(s_{t}\right)+\alpha\left[r_{t}+\gamma \cdot v\left(s_{t+1}\right)-v\left(s_{t}\right)\right]

Temporal-difference learning is more efficient as it can learn before knowing the final outcome and can learn from incomplete sequences of experience, and such a learning strategy has been seem as the most central and novel idea of RL [17].
时间差分学习更为高效,因为它可以在知道最终结果之前进行学习,并且可以从不完整的经验序列中学习,这种学习策略被视为强化学习中最核心和新颖的思想[17]。
Within the learning processes of RL, the interaction between agent and environment can be either model-based or model-free. MDPs with a transition probability matrix that determines the next state are referred to as model-based RL algorithms, whereas model-free frameworks do not construct an explicit model of the environment. Model-free RL utilizes a simulator that mimics the environment to train the algorithm, rather than assuming a prior understanding of the environmental dynamics, making it particularly pertinent in real-world scenarios. Due to the intricate nature of engineering systems’ dynamics, accurately depicting the environment in many practical scenarios is challenging [18]. Consequently, model-free RL is predominantly employed in different fields. Nevertheless, traditional RL, particularly in its tabular form, struggles with continuous state and action spaces, high-dimensional spaces, and long decision horizons [19].
在强化学习的学习过程中,智能体与环境之间的互动可以是基于模型的或无模型的。具有确定下一个状态的转移概率矩阵的马尔可夫决策过程(MDP)被称为基于模型的强化学习算法,而无模型框架则不构建环境的显式模型。无模型强化学习利用模拟器来模拟环境以训练算法,而不是假设对环境动态的先验理解,这使其在现实世界场景中尤为相关。由于工程系统动态的复杂性,在许多实际场景中准确描绘环境是具有挑战性的。因此,无模型强化学习在不同领域中被广泛应用。然而,传统的强化学习,特别是在其表格形式中,在连续状态和动作空间、高维空间以及长决策时间范围方面面临困难。

2.3. Deep reinforcement learning
2.3. 深度强化学习

With the rise of deep learning, deep neural networks have shown powerful function approximation and representation learning properties, providing a new impetus to the development of RL. DRL, combining deep learning and RL, has emerged as a transformative paradigm in response to the limitations of traditional RL [20]. DRL leverages the robust function approximation capabilities of deep neural networks with RL’s learning framework to enhance decision-making [21,22]. At its core, DRL employs deep neural networks to establish a sophisticated, non-linear relationship between sensory inputs and either the values associated with potential actions or the probabilities of taking specific actions [ 23 , 24 ] [ 23 , 24 ] [23,24][23,24]. This process is enhanced through feedback signals that fine-tune the network’s weights to improve value function estimates or enhance the occurrence of actions leading to higher rewards. An early milestone of DRL was the development of TD-Gammon in the early 1990s, a system that utilized neural networks and RL to achieve competitive backgammon play against top human players [25]. TD-Gammon was grounded in a temporal difference RL algorithm, which was pivotal in estimating the win probability for each board position encountered. It adjusts its strategy based on the reward-prediction error, which signaled unexpected gains or losses.
随着深度学习的兴起,深度神经网络展现了强大的函数逼近和表示学习能力,为强化学习的发展提供了新的动力。深度强化学习(DRL)结合了深度学习和强化学习,作为应对传统强化学习局限性的变革性范式而出现。DRL 利用深度神经网络强大的函数逼近能力与强化学习的学习框架相结合,以增强决策能力。DRL 的核心是使用深度神经网络建立感知输入与潜在动作相关的值或采取特定动作的概率之间的复杂非线性关系。这个过程通过反馈信号得到增强,反馈信号微调网络的权重,以改善价值函数估计或增强导致更高奖励的动作的发生。DRL 的早期里程碑是 1990 年代初期开发的 TD-Gammon,这是一个利用神经网络和强化学习实现与顶级人类玩家进行竞争性西洋双陆棋对弈的系统。 TD-Gammon 基于时间差分强化学习算法,这在估计每个遇到的棋盘位置的胜率方面至关重要。它根据奖励预测误差调整其策略,该误差指示了意外的收益或损失。
The landscape of DRL experienced a transformative shift in 2013 with the introduction of the Deep Q-Network (DQN), marking the first instance where a DRL system learned to navigate the complexities of classic Atari video games [26,27]. DQN’s success relied on its innovative techniques, namely experience replay and target network mechanisms, to mitigate the inherent non-stationarity of RL algorithms, rendering them more analogous to supervised learning challenges and thus more amenable to deep learning methodologies. Since the advent of DQN, DRL research has rapidly expanded into diverse and complex fields such as chess and Go [28], and strategic games Dota [29] and StarCraft II [30]. DRL algorithms can be classified into model-based and model-free
深度强化学习(DRL)的领域在 2013 年经历了变革性的转变,随着深度 Q 网络(DQN)的引入,标志着 DRL 系统首次学习如何应对经典 Atari 视频游戏的复杂性[26,27]。DQN 的成功依赖于其创新技术,即经验重放和目标网络机制,以减轻强化学习算法固有的非平稳性,使其更类似于监督学习的挑战,从而更适合深度学习方法。自 DQN 问世以来,DRL 研究迅速扩展到国际象棋、围棋[28]以及战略游戏 Dota[29]和星际争霸 II[30]等多样且复杂的领域。DRL 算法可以分为基于模型和无模型两类。
Fig. 1. The interaction between agent and environment in an MDP.
图 1. 在马尔可夫决策过程中的智能体与环境之间的互动。

methods. Model-based algorithms can be further divided into two categories depending on how the model of the environment is obtained: the “learn the model” and “given the model” approaches. In the “given the model” approach, the model of the environment is predefined, and the agent does not need to learn the model from data, allowing for efficient planning and decision-making. This approach is typically utilized in scenarios where the environments are well-understood and can be accurately modeled, such as in board games or physical simulations [31]. In the “learn the model” approach, the agent must learn a model of the environment through interactions [32]. This approach predicts the dynamics of the environment, including state transitions and rewards. The process of learning the model typically involves collecting and analyzing a significant amount of data to ensure the model can accurately reflect the behavior of the environment [33]. According to the fundamental structure and learning approach, DRL can be categorized into value-based, policy-based, and actor-critic methods. Value-based algorithms infer the optimal policy through approximately estimation of the action-value function. Conversely, policy-based algorithms directly approximate the optimal policy, thereby converting the learning challenge into a more explicit optimization problem. Actor-critic algorithms combine the strengths of both value-based and policy-based approaches. By parameterizing both the policy (actor) and the value function (critic), these algorithms optimize the utilization of training data while ensuring stability in convergence.
方法。基于模型的算法可以根据环境模型的获取方式进一步分为两类:“学习模型”和“给定模型”方法。在“给定模型”方法中,环境模型是预定义的,代理不需要从数据中学习模型,从而实现高效的规划和决策。该方法通常用于环境被充分理解并可以准确建模的场景,例如棋类游戏或物理模拟[31]。在“学习模型”方法中,代理必须通过交互学习环境模型[32]。该方法预测环境的动态,包括状态转移和奖励。学习模型的过程通常涉及收集和分析大量数据,以确保模型能够准确反映环境的行为[33]。根据基本结构和学习方法,深度强化学习(DRL)可以分为基于价值、基于策略和演员-评论家方法。 基于价值的算法通过对动作价值函数的近似估计推断最优策略。相反,基于策略的算法直接近似最优策略,从而将学习挑战转化为更明确的优化问题。演员-评论家算法结合了基于价值和基于策略方法的优点。通过对策略(演员)和价值函数(评论家)进行参数化,这些算法在优化训练数据的利用同时确保收敛的稳定性。
Fig. 2 visualizes some typical structures of neural networks for representing the action-value function, the policy, or an actor-critic combination of policy and value functions. In the architecture for valuebased methods, the input state s s ss is processed through multiple hidden layers, and the output layer produces the action-value function q ( s , a ; θ ) q ( s , a ; θ ) q(s,a;theta)q(s, a ; \theta) for each action. The policy-based methods leverage a neural network to approximate the policy π ( a s ; θ ) π ( a s ; θ ) pi(a∣s;theta)\pi(a \mid s ; \theta), which directly maps states to a probability distribution over actions. The network’s input is the state, and the output is a SoftMax layer providing the probability of each action. The actor-critic architecture combines elements of both value-based and policy-based methods. It consists of two parts: the actor network and the critic network. The actor network outputs the policy through a SoftMax layer, while the critic network outputs the state-value, representing the expected return from a given state. By parameterizing the action-value function and policy with learnable network weights, DRL can iteratively fine-tune the weights θ θ theta\theta to learn the optimal policy based on the gradient of the loss:
图 2 可视化了一些典型的神经网络结构,用于表示动作-价值函数、策略或演员-评论家组合的策略和值函数。在基于价值的方法架构中,输入状态 s s ss 通过多个隐藏层进行处理,输出层为每个动作生成动作-价值函数 q ( s , a ; θ ) q ( s , a ; θ ) q(s,a;theta)q(s, a ; \theta) 。基于策略的方法利用神经网络来近似策略 π ( a s ; θ ) π ( a s ; θ ) pi(a∣s;theta)\pi(a \mid s ; \theta) ,该策略直接将状态映射到动作的概率分布。网络的输入是状态,输出是一个 SoftMax 层,提供每个动作的概率。演员-评论家架构结合了基于价值和基于策略的方法的元素。它由两个部分组成:演员网络和评论家网络。演员网络通过 SoftMax 层输出策略,而评论家网络输出状态-价值,表示从给定状态的预期回报。通过用可学习的网络权重对动作-价值函数和策略进行参数化,深度强化学习可以迭代地微调权重 θ θ theta\theta ,以根据损失的梯度学习最优策略:

θ k + 1 θ k α θ k L ( θ k ) θ k + 1 θ k α θ k L θ k theta_(k+1)larrtheta_(k)-alphagrad_(theta_(k))L(theta_(k))\theta_{k+1} \leftarrow \theta_{k}-\alpha \nabla_{\theta_{k}} \mathscr{L}\left(\theta_{k}\right),
where θ k θ k theta_(k)\theta_{k} is the parameters of neural network at training iteration k , α > k , α > k,alpha >k, \alpha> 0 is the learning rate, and L ( θ k ) L θ k L(theta_(k))\mathscr{L}\left(\theta_{k}\right) is the loss function measures how well the neural network approximates the policy or value functions using observations and the output of the neural network.
在训练迭代 k , α > k , α > k,alpha >k, \alpha> 中, θ k θ k theta_(k)\theta_{k} 是神经网络的参数,0 是学习率, L ( θ k ) L θ k L(theta_(k))\mathscr{L}\left(\theta_{k}\right) 是损失函数,用于衡量神经网络使用观察值和神经网络输出来近似策略或价值函数的效果。
The taxonomy of RL/DRL algorithms and some representative algorithms is presented in Fig. 3. It is worth noting that classifications in RL/DRL may sometimes overlap, especially between actor-critic and policy-based methods. For example, while PPO is often categorized as a policy-based method, it typically incorporates an actor-critic architecture in its implementation, where the critic is used to estimate the advantage function and the actor updates the policy based on the estimates.
RL/DRL 算法的分类及一些代表性算法如图 3 所示。值得注意的是,RL/DRL 中的分类有时可能会重叠,特别是在演员-评论家方法和基于策略的方法之间。例如,虽然 PPO 通常被归类为基于策略的方法,但它在实现中通常包含演员-评论家架构,其中评论家用于估计优势函数,而演员则根据这些估计更新策略。

3. Reinforcement learning in reliability optimization
3. 可靠性优化中的强化学习

Ensuring reliability is crucial during the operational phase of engineered systems. However, as the systems often operate in dynamic environments with varying conditions, the system typically experiences random degradation profiles that may threaten its reliable operation. Therefore, a series of endeavors have been undertaken to enhance system reliability, such as system reconfiguration, mission abort, loading optimization problems [34,35]. MDP provides an analytical framework for reliability optimization in operation phase. By modeling the sequential decision problems as an MDP, multiple RL algorithms can be implemented to solve the optimal policies. In the literature, many studies focus only on the components that have a pivotal impact on system reliability, often simplifying the system to a single-component system. The reliability optimization for single-component systems typically possesses limited state and action spaces and can precisely be solved by classical algorithms such as value iteration and policy iteration [36]. For the reliability optimization of multi-component systems or a system fleet, the state and action spaces increases exponentially as the number of components grows, thereby rendering the solution of the optimal policy significantly more challenging [37]. To address the curse of dimensionality associated with increasing problem size, DRL techniques have been introduced to identify the optimal policy in a
确保可靠性在工程系统的操作阶段至关重要。然而,由于系统通常在动态环境中以不同条件运行,系统通常会经历随机退化特征,这可能威胁其可靠运行。因此,已经开展了一系列努力以增强系统可靠性,例如系统重构、任务中止、负载优化问题 [34,35]。马尔可夫决策过程(MDP)为操作阶段的可靠性优化提供了分析框架。通过将顺序决策问题建模为 MDP,可以实施多种强化学习算法来解决最优策略。在文献中,许多研究仅关注对系统可靠性具有关键影响的组件,通常将系统简化为单组件系统。单组件系统的可靠性优化通常具有有限的状态和动作空间,可以通过经典算法如值迭代和策略迭代精确解决 [36]。 对于多组件系统或系统舰队的可靠性优化,随着组件数量的增加,状态和动作空间呈指数级增长,从而使得最优策略的求解变得显著更加困难[37]。为了应对与问题规模增加相关的维度诅咒,已引入深度强化学习技术以识别最优策略

Actor-critic methods 演员-评论家方法
Fig. 2. The structures of neural networks in DRL.
图 2. 深度强化学习中的神经网络结构。
Fig. 3. Taxonomy of DRL algorithms. (I2A (Imagination-Augmented Agents), MBMF (Model-Based RL with Model-Free Fine-Tuning), PILCO (Probabilistic Inference for Learning Control), SARSA (State-Action-Reward-State-Action), DQN (Deep Q-Network), QR-DQN (Quantile Regression DQN), HER (Hindsight Experience Replay), TRPO (Trust Region Policy Optimization), DDPG (Deep Deterministic Policy Gradient), A2C (Advantage Actor Critic), A3C (Asynchronous Advantage Actor Critic), TD3 (Twin Delayed DDPG), SAC (Soft Actor-Critic)).
图 3. DRL 算法的分类。(I2A(想象增强代理),MBMF(基于模型的强化学习与无模型微调),PILCO(用于学习控制的概率推理),SARSA(状态-动作-奖励-状态-动作),DQN(深度 Q 网络),QR-DQN(分位回归 DQN),HER(事后经验重放),TRPO(信任区域策略优化),DDPG(深度确定性策略梯度),A2C(优势演员评论家),A3C(异步优势演员评论家),TD3(双延迟 DDPG),SAC(软演员-评论家))。
Fig. 4. An overlook of the existing literature on RL in reliability optimization [14,34,36-42].
图 4. 关于可靠性优化中强化学习的现有文献概述 [14,34,36-42]。

computationally efficient manner. Fig. 4 provides a brief review of the literature on reinforcement learning in reliability optimization.
以计算高效的方式。图 4 对可靠性优化中强化学习的文献进行了简要回顾。
In this section, we take the mission abort problems as an example to illustrate the application of reinforcement learning in reliability optimization, including MDP modeling, RL algorithm design, and application to a gas pipeline system.
在本节中,我们以任务中止问题为例,说明强化学习在可靠性优化中的应用,包括 MDP 建模、RL 算法设计以及在天然气管道系统中的应用。

3.1. Markov decision process formulation
3.1. 马尔可夫决策过程的表述

Developing an MDP model is critical to solving reliability optimization problems in engineering practices. Firstly, it is essential to identify all possible states of the system, which should comprehensively represent different system operational conditions. These states can be determined based on historical performance data and expert knowledge. Secondly, identify the actions that can influence the system’s state, such as system reconfiguration, mission abort, and loading distribution. Thirdly, define the probabilities of transitioning from one state to another given a specific action. These probabilities can be estimated using historical mission data, failure data, and other relevant information. Statistical methods and reliability analysis tools can be used to derive accurate transition probabilities. Lastly, it is also crucial to develop a reward function that quantifies the benefits for all state and action pairs. This function typically includes cost elements such as operation cost, mission profit, and failure penalties, with the goal of maximizing the expected cumulative reward over the planning horizon. In summary, to support the MDP model development, relevant data,
开发 MDP 模型对于解决工程实践中的可靠性优化问题至关重要。首先,必须识别系统的所有可能状态,这些状态应全面代表不同的系统操作条件。这些状态可以基于历史性能数据和专家知识来确定。其次,识别可以影响系统状态的行动,例如系统重配置、任务中止和负载分配。第三,定义在特定行动下从一个状态转移到另一个状态的概率。这些概率可以使用历史任务数据、故障数据和其他相关信息进行估计。可以使用统计方法和可靠性分析工具来推导准确的转移概率。最后,开发一个奖励函数也至关重要,该函数量化所有状态和行动对的收益。该函数通常包括操作成本、任务利润和故障惩罚等成本要素,目标是在规划期内最大化预期的累计奖励。总之,为了支持 MDP 模型的开发,相关数据,

including system performance data, historical failure data, operational data, and mission information, should be collected.
应收集系统性能数据、历史故障数据、操作数据和任务信息。
Suppose a system is mean to perform a specific mission. The system state degrades over operation and degraded states are denoted as { 1 , { 1 , {1,dots\{1, \ldots , n } , n } ,n}, n\} distinguished by its performance capacity. State 1 and state n n nn the completely failed state and perfectly functioning state, respectively. The performance capacity of the system in state i i ii is denoted as g i ( i 1 , , n ) g i ( i 1 , , n ) g_(i)(i in1,cdots,n)g_{i}(i \in 1, \cdots, n) and one has g i < g j , i < j g i < g j , i < j g_(i) < g_(j),AA i < jg_{i}<g_{j}, \forall i<j. The system undergoes periodic inspection with equal time interval Δ t Δ t Delta t\Delta t, and the system state at the beginning of each time step can be precisely observed. The state and performance capacity of the system at time step t t tt are represented as X t X t X_(t)X_{t} and G t G t G_(t)G_{t}, respectively. The degradation process of the system is characterized by a homogeneous discrete-time Markov chain with state transition matrix P = P = P=\mathbf{P}= [ p i j ] n × n p i j n × n [p_(i*j)]_(n xx n)\left[p_{i \cdot j}\right]_{n \times n}, where p i , j ( i , j = 1 , , n ) p i , j ( i , j = 1 , , n ) p_(i,j)(i,j=1,cdots,n)p_{i, j}(i, j=1, \cdots, n) denotes the state transition probability from state i i ii to state j . P j . P j.Pj . \mathbf{P} is a lower triangular matrix that the state of the system cannot be restored. The state transition matrix can be evaluated by the system’s historical observation data [43]. At each time step t t tt, the decision maker has the option to either abort the mission or continue with it. The cost of mission failure and system failure (state 0 ) are denoted as c m c m c_(m)c_{\mathrm{m}} and c f c f c_(f)c_{\mathrm{f}}, respectively. If the selected action is aborting the mission, denoted as a 0 a 0 a_(0)a_{0}, then a mission failure cost c m c m c_(m)c_{\mathrm{m}} is incurred. If the selected action is continuing the mission, denoted as a 1 a 1 a_(1)a_{1}, the system will continue to execute the mission till the next inspection, and an operation cost c op c op c_(op)c_{\mathrm{op}} is incurred. If the system fails during mission, then the mission failure cost and system failure cost c m + c f c m + c f c_(m)+c_(f)c_{\mathrm{m}}+c_{f} are incurred. The success of mission depends on the system cumulative performance and demand. The cumulative performance of the system at time step t t tt is denoted as φ t φ t varphi_(t)\varphi_{t}, and one has φ t = τ = 1 t G τ Δ t φ t = τ = 1 t G τ Δ t varphi_(t)=sum_(tau=1)^(t)G_(tau)Delta t\varphi_{t}=\sum_{\tau=1}^{t} G_{\tau} \Delta t. If the cumulative performance of the system exceeds the pre-specified demand D D DD within mission duration T T TT, i.e., φ t D , t [ 1 , T ] φ t D , t [ 1 , T ] varphi_(t) >= D,EE t in[1,T]\varphi_{t} \geq D, \exists t \in[1, T], the mission succeeds and a profit r m r m r_(m)r_{\mathrm{m}} is yielded. Conversely, if the system is in an operating state throughout the mission duration T T TT but the cumulative performance is less than the demand, i.e., φ T < D φ T < D varphi_(T) < D\varphi_{T}<D, a mission failure cost c m c m c_(m)c_{\mathrm{m}} will be incurred. Since the stochastic nature of system degradation, the mission abort problem is a dynamic optimization problem, where the actions have to be dynamically identified according the system degraded state and mission progress to maximize the expected profit. The dynamic mission abort problem is formulated as a finite horizon MDP, and the state space, action space, reward function, and Bellman optimality equation are presented as follows.
假设一个系统旨在执行特定任务。系统状态在操作过程中会退化,退化状态用 { 1 , { 1 , {1,dots\{1, \ldots , n } , n } ,n}, n\} 表示,按其性能能力区分。状态 1 和状态 n n nn 分别是完全失败状态和完美运行状态。系统在状态 i i ii 的性能能力表示为 g i ( i 1 , , n ) g i ( i 1 , , n ) g_(i)(i in1,cdots,n)g_{i}(i \in 1, \cdots, n) ,并且有 g i < g j , i < j g i < g j , i < j g_(i) < g_(j),AA i < jg_{i}<g_{j}, \forall i<j 。系统以相等的时间间隔 Δ t Δ t Delta t\Delta t 进行定期检查,每个时间步开始时可以精确观察系统状态。时间步 t t tt 时系统的状态和性能能力分别表示为 X t X t X_(t)X_{t} G t G t G_(t)G_{t} 。系统的退化过程由一个均匀的离散时间马尔可夫链特征化,其状态转移矩阵为 P = P = P=\mathbf{P}= [ p i j ] n × n p i j n × n [p_(i*j)]_(n xx n)\left[p_{i \cdot j}\right]_{n \times n} ,其中 p i , j ( i , j = 1 , , n ) p i , j ( i , j = 1 , , n ) p_(i,j)(i,j=1,cdots,n)p_{i, j}(i, j=1, \cdots, n) 表示从状态 i i ii 到状态 j . P j . P j.Pj . \mathbf{P} 的状态转移概率,是一个下三角矩阵,表明系统的状态无法恢复。状态转移矩阵可以通过系统的历史观察数据进行评估[43]。在每个时间步 t t tt ,决策者可以选择中止任务或继续执行任务。 任务失败和系统失败(状态 0)的成本分别表示为 c m c m c_(m)c_{\mathrm{m}} c f c f c_(f)c_{\mathrm{f}} 。如果选择的行动是中止任务,表示为 a 0 a 0 a_(0)a_{0} ,则会产生任务失败成本 c m c m c_(m)c_{\mathrm{m}} 。如果选择的行动是继续任务,表示为 a 1 a 1 a_(1)a_{1} ,系统将继续执行任务直到下次检查,并产生操作成本 c op c op c_(op)c_{\mathrm{op}} 。如果系统在任务期间失败,则会产生任务失败成本和系统失败成本 c m + c f c m + c f c_(m)+c_(f)c_{\mathrm{m}}+c_{f} 。任务的成功取决于系统的累积性能和需求。系统在时间步 t t tt 的累积性能表示为 φ t φ t varphi_(t)\varphi_{t} ,并且有 φ t = τ = 1 t G τ Δ t φ t = τ = 1 t G τ Δ t varphi_(t)=sum_(tau=1)^(t)G_(tau)Delta t\varphi_{t}=\sum_{\tau=1}^{t} G_{\tau} \Delta t 。如果系统的累积性能在任务持续时间 T T TT 内超过预先指定的需求 D D DD ,即 φ t D , t [ 1 , T ] φ t D , t [ 1 , T ] varphi_(t) >= D,EE t in[1,T]\varphi_{t} \geq D, \exists t \in[1, T] ,则任务成功并产生利润 r m r m r_(m)r_{\mathrm{m}} 。相反,如果系统在整个任务持续时间 T T TT 内处于操作状态,但累积性能低于需求,即 φ T < D φ T < D varphi_(T) < D\varphi_{T}<D ,则会产生任务失败成本 c m c m c_(m)c_{\mathrm{m}} 。 由于系统退化的随机特性,任务中止问题是一个动态优化问题,其中必须根据系统的退化状态和任务进展动态识别行动,以最大化预期收益。动态任务中止问题被表述为有限时间范围的马尔可夫决策过程(MDP),状态空间、行动空间、奖励函数和贝尔曼最优性方程如下所示。
State space: The state is a representation of the environment accessible to the agent at each decision step, the states are constantly changing based on the interaction of the agent with the environment and depending on the nature of the problem. In the dynamic mission abort problem, the mission abort decision takes into account not only the degraded state of the system, but also the execution progress of the mission. The worse the state of the system, the lower the cumulative performance of the system, and the less time remaining on the mission, the more the decision maker is inclined to abort the mission to minimize the loss in time. The state space, therefore, consists of the system degraded state, system cumulative performance, and the remaining mission time, i.e., s t { ( X t , G t , δ t ) | 0 < X t n , 0 G t < D s t X t ,      G t ,      δ t 0 < X t n , 0 G t < D {:s_(t)in{[(X_(t),:},G_(t)",",delta_(t)])|0 < X_(t) <= n,quad0 <= G_(t) < D\left.\mathbf{s}_{t} \in\left\{\begin{array}{lll}\left(X_{t},\right. & G_{t}, & \delta_{t}\end{array}\right) \right\rvert\, 0<X_{t} \leq n, \quad 0 \leq G_{t}<D, 0 δ t < T } { ϕ } 0 δ t < T { ϕ } {:0 <= delta_(t) < T}uu{phi}\left.0 \leq \delta_{t}<T\right\} \cup\{\phi\}, where ϕ ϕ phi\phi is the terminal state which contains four cases: 1) the system fails during mission, 2) the mission successes, i.e., the system cumulative performance exceeds the demand, 3 ) the time in the mission reach the maximum allowable mission time, and 4) the mission is aborted.
状态空间:状态是代理在每个决策步骤中可访问的环境的表示,状态根据代理与环境的交互以及问题的性质不断变化。在动态任务中止问题中,任务中止决策不仅考虑系统的退化状态,还考虑任务的执行进度。系统状态越差,系统的累积性能越低,任务剩余时间越少,决策者越倾向于中止任务以最小化时间损失。因此,状态空间由系统的退化状态、系统的累积性能和剩余任务时间组成,即 s t { ( X t , G t , δ t ) | 0 < X t n , 0 G t < D s t X t ,      G t ,      δ t 0 < X t n , 0 G t < D {:s_(t)in{[(X_(t),:},G_(t)",",delta_(t)])|0 < X_(t) <= n,quad0 <= G_(t) < D\left.\mathbf{s}_{t} \in\left\{\begin{array}{lll}\left(X_{t},\right. & G_{t}, & \delta_{t}\end{array}\right) \right\rvert\, 0<X_{t} \leq n, \quad 0 \leq G_{t}<D 0 δ t < T } { ϕ } 0 δ t < T { ϕ } {:0 <= delta_(t) < T}uu{phi}\left.0 \leq \delta_{t}<T\right\} \cup\{\phi\} ,其中 ϕ ϕ phi\phi 是终端状态,包含四种情况:1)系统在任务中失败,2)任务成功,即系统的累积性能超过需求,3)任务时间达到最大允许任务时间,4)任务被中止。
Action space: In the duration of mission executing, the decision maker has the option to either abort the task or continue with it at each time step t t tt. The action space, therefore, can be denoted as { a 0 , a 1 } a 0 , a 1 {a_(0),a_(1)}\left\{a_{0}, a_{1}\right\}, where a 0 a 0 a_(0)a_{0} and a 1 a 1 a_(1)a_{1} represents the actions to abort and continue the mission, respectively.
行动空间:在任务执行期间,决策者在每个时间步 t t tt 都有选择中止任务或继续执行的选项。因此,行动空间可以表示为 { a 0 , a 1 } a 0 , a 1 {a_(0),a_(1)}\left\{a_{0}, a_{1}\right\} ,其中 a 0 a 0 a_(0)a_{0} a 1 a 1 a_(1)a_{1} 分别代表中止和继续任务的行动。
Based on the state and the selected action at time step t t tt, the state transition for the next time interval Δ t Δ t Delta t\Delta t can be described as follows. If action a 0 a 0 a_(0)a_{0} is selected, the mission will be aborted immediately, and the MDP transits to the terminal state with a mission failure cost. If action a 1 a 1 a_(1)a_{1} is selected, the state transition probability can be evaluated by:
基于时间步 t t tt 的状态和选择的动作,下一时间间隔 Δ t Δ t Delta t\Delta t 的状态转移可以描述如下。如果选择动作 a 0 a 0 a_(0)a_{0} ,任务将立即中止,MDP 将转移到终止状态,并产生任务失败成本。如果选择动作 a 1 a 1 a_(1)a_{1} ,状态转移概率可以通过以下方式评估:

P ( s t + 1 s t , a t = a 1 ) = p X t , X t + 1 I ( G t + 1 = G t + g X t ) I ( δ t + 1 = δ t Δ t ) P s t + 1 s t , a t = a 1 = p X t , X t + 1 I G t + 1 = G t + g X t I δ t + 1 = δ t Δ t P(s_(t+1)∣s_(t),a_(t)=a_(1))=p_(X_(t),X_(t+1))*I(G_(t+1)=G_(t)+g_(X_(t)))*I(delta_(t+1)=delta_(t)-Delta t)P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}=a_{1}\right)=p_{X_{t}, X_{t+1}} \cdot I\left(G_{t+1}=G_{t}+g_{X_{t}}\right) \cdot I\left(\delta_{t+1}=\delta_{t}-\Delta t\right).
where I ( ) I ( ) I(*)I(\cdot) is an indicator function that equals one if condition ( ) ( ) (*)(\cdot) is true and zero otherwise.
其中 I ( ) I ( ) I(*)I(\cdot) 是一个指示函数,当条件 ( ) ( ) (*)(\cdot) 为真时等于一,否则等于零。
Reward: To maximize the expected total profit of the system over the planning horizon, the reward function of the mission abort problem is designed to reflect cost and mission profit. The reward, serving as the evaluative feedback of a selected action, is calculated based on the profit or cost incurred between two consecutive decisions and can be formulated as follows:
奖励:为了最大化规划期内系统的预期总利润,任务中止问题的奖励函数旨在反映成本和任务利润。奖励作为所选行动的评估反馈,是基于两次连续决策之间产生的利润或成本进行计算的,可以表述如下:
r t ( s t , a t , s t + 1 ) = { c m a t = a 0 r m c op a t = a 1 , G t + 1 D c op c m c f a t = a 1 , G t + 1 < D , X t + 1 = 0 c op c m a t = a 1 , G t + 1 < D , X t + 1 > 0 , t = T 1 c op others . r t s t , a t , s t + 1 = c m      a t = a 0 r m c op      a t = a 1 , G t + 1 D c op c m c f      a t = a 1 , G t + 1 < D , X t + 1 = 0 c op c m      a t = a 1 , G t + 1 < D , X t + 1 > 0 , t = T 1 c op       others  . r_(t)(s_(t),a_(t),s_(t+1))={[-c_(m),a_(t)=a_(0)],[r_(m)-c_(op),a_(t)=a_(1)","G_(t+1) >= D],[-c_(op)-c_(m)-c_(f),a_(t)=a_(1)","G_(t+1) < D","X_(t+1)=0],[-c_(op)-c_(m),a_(t)=a_(1)","G_(t+1) < D","X_(t+1) > 0","t=T-1],[-c_(op)," others "].:}r_{t}\left(\mathbf{s}_{t}, a_{t}, \boldsymbol{s}_{t+1}\right)=\left\{\begin{array}{ll} -c_{\mathrm{m}} & a_{t}=a_{0} \\ r_{\mathrm{m}}-c_{\mathrm{op}} & a_{t}=a_{1}, G_{t+1} \geq D \\ -c_{\mathrm{op}}-c_{\mathrm{m}}-c_{\mathrm{f}} & a_{t}=a_{1}, G_{t+1}<D, X_{t+1}=0 \\ -c_{\mathrm{op}}-c_{\mathrm{m}} & a_{t}=a_{1}, G_{t+1}<D, X_{t+1}>0, t=T-1 \\ -c_{\mathrm{op}} & \text { others } \end{array} .\right.
It is noted that the reward function is related not only to the current state and the selected action, but also to the system’s degraded state and mission progress at the next time step. If the system fails at the next time
奖励函数不仅与当前状态和选择的动作相关,还与系统在下一个时间步的退化状态和任务进展相关。如果系统在下一个时间步失败,
Algorithm 1: Value iteration
    1: Initialize \(v\left(\mathbf{s}_{t}\right)\) for all \(\mathbf{s}_{t} \in \mathrm{~S}\)
    2: While \(\Delta>\xi\) do
    3: \(\quad\) For each state \(s_{t} \in \mathrm{~S}\) do
    4: \(\quad v \leftarrow v\left(\mathbf{s}_{t}\right)\)
    5: \(\quad v\left(\mathbf{s}_{t}\right) \leftarrow \max _{a_{t} \in A} \sum_{\mathbf{s}_{t+1}} P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}\right)\left[r\left(\mathbf{s}_{t}, a_{t}, \mathbf{s}_{t+1}\right)+v\left(\mathbf{s}_{t+1}\right)\right]\)
    6: \(\quad \Delta \leftarrow \max \left\{\Delta,\left|v-v\left(\mathbf{s}_{t}\right)\right|\right\}\)
    7: End For
    8: End While
Fig. 5. The pseudo-code of the value iteration algorithm.
图 5. 值迭代算法的伪代码。

step, the mission failure cost and system failure cost are incurred, and if the mission succeeds at the next time step, a profit is yielded.
在这个步骤中,任务失败成本和系统失败成本被产生,如果在下一个时间步骤中任务成功,则会产生利润。
With the aim at minimizing the expected profit, the value functions v ( s t ) v s t v^(**)(s_(t))v^{*}\left(\mathbf{s}_{t}\right) can be evaluated by solving the Bellman optimality equations:
为了最小化预期利润,可以通过求解贝尔曼最优性方程来评估价值函数 v ( s t ) v s t v^(**)(s_(t))v^{*}\left(\mathbf{s}_{t}\right)
v ( s t ) = max a t { a 0 , a 1 } { s t + 1 P ( s t + 1 s t , a t ) [ r t ( s t , a t , s t + 1 ) + v ( s t + 1 ) ] } = max a t { a 0 , a 1 } { c m a t = a 0 s t + 1 P ( s t + 1 s t , a t = a 1 ) [ r t ( s t , a t , s t + 1 ) + v ( s t + 1 ) ] a t = a 1 v s t = max a t a 0 , a 1 s t + 1 P s t + 1 s t , a t r t s t , a t , s t + 1 + v s t + 1 = max a t a 0 , a 1 c m a t = a 0 s t + 1 P s t + 1 s t , a t = a 1 r t s t , a t , s t + 1 + v s t + 1 a t = a 1 {:[v^(**)(s_(t)),=max_(a_(t)in{a_(0),a_(1)}){sum_(s_(t+1))P(s_(t+1)∣s_(t),a_(t))*[r_(t)(s_(t),a_(t),s_(t+1))+v^(**)(s_(t+1))]}],[,=max_(a_(t)in{a_(0),a_(1)}){[-c_(m),a_(t)=a_(0)],[sum_(s_(t+1))P(s_(t+1)∣s_(t),a_(t)=a_(1))*[r_(t)(s_(t),a_(t),s_(t+1))+v^(**)(s_(t+1))],a_(t)=a_(1)]:}]:}\begin{array}{rlr} v^{*}\left(\mathbf{s}_{t}\right) & =\max _{a_{t} \in\left\{a_{0}, a_{1}\right\}}\left\{\sum_{s_{t+1}} P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}\right) \cdot\left[r_{t}\left(\mathbf{s}_{t}, a_{t}, \mathbf{s}_{t+1}\right)+v^{*}\left(\mathbf{s}_{t+1}\right)\right]\right\} \\ & =\max _{a_{t} \in\left\{a_{0}, a_{1}\right\}} \begin{cases}-c_{\mathrm{m}} & a_{t}=a_{0} \\ \sum_{s_{t+1}} P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}=a_{1}\right) \cdot\left[r_{t}\left(\mathbf{s}_{t}, a_{t}, \mathbf{s}_{t+1}\right)+v^{*}\left(\mathbf{s}_{t+1}\right)\right] & a_{t}=a_{1}\end{cases} \end{array}

3.2. Solution algorithms
3.2. 解决算法

Section 3.1 provided a universal formulation for mission abort problems. In this section, the value iteration, Q-learning, and DQN algorithms are used as examples to illustrate how to utilize DP, RL and DRL approaches to solve the dynamic mission abort problems, respectively. These algorithms are implemented in a Python 3.9 environment using PyTorch package, and the entire codes are provided in the supplementary materials.
第 3.1 节提供了任务中止问题的通用公式。在本节中,值迭代、Q 学习和 DQN 算法被用作示例,以说明如何分别利用动态规划(DP)、强化学习(RL)和深度强化学习(DRL)方法来解决动态任务中止问题。这些算法在 Python 3.9 环境中使用 PyTorch 包实现,完整代码在补充材料中提供。
Value iteration: Value iteration is a classic DP algorithm to find the optimal policy by iteratively estimating the optimal value function. In dynamic mission abort problems, the value function denotes as the expected total profit when starting in state s t s t s_(t)s_{t}. The value iteration algorithm can be expressed as the following iteration pattern:
值迭代:值迭代是一种经典的动态规划算法,通过迭代估计最优价值函数来寻找最优策略。在动态任务中止问题中,价值函数表示为在状态 s t s t s_(t)s_{t} 下开始时的预期总利润。值迭代算法可以表示为以下迭代模式:

Q-learning: Q-learning is a model-free algorithm that estimates the optimal action-value function without a model of the environment. It learns from experiences with the iteration with environment. Q-learning algorithm typically utilizes the ε ε epsi\varepsilon-greedy strategy for exploration where the agent performs a random action with probability ε [ 0 , 1 ] ε [ 0 , 1 ] epsi in[0,1]\varepsilon \in[0,1] and performs the returns-maximizing action otherwise. Typically, a larger value needs to be set in the early iterations of the algorithm to fully explored the action space. In this algorithm, ε ε epsi\varepsilon is generated by progressively decreasing from the maximum value ε max ε max epsi_(max)\varepsilon_{\max } over an interval Δ ε Δ ε Delta epsi\Delta \varepsilon until the minimum value ε min ε min epsi_(min)\varepsilon_{\min }. The agent selects actions based on the observed states of the environment, and each action results in a reward as well as the next state of the environment.
Q 学习:Q 学习是一种无模型算法,它在没有环境模型的情况下估计最优动作价值函数。它通过与环境的迭代经验进行学习。Q 学习算法通常采用 ε ε epsi\varepsilon -贪婪策略进行探索,其中代理以概率 ε [ 0 , 1 ] ε [ 0 , 1 ] epsi in[0,1]\varepsilon \in[0,1] 执行随机动作,而在其他情况下执行回报最大化的动作。通常,在算法的早期迭代中需要设置一个较大的值,以充分探索动作空间。在该算法中, ε ε epsi\varepsilon 通过在一个区间 Δ ε Δ ε Delta epsi\Delta \varepsilon 内从最大值 ε max ε max epsi_(max)\varepsilon_{\max } 逐渐减小到最小值 ε min ε min epsi_(min)\varepsilon_{\min } 而生成。代理根据观察到的环境状态选择动作,每个动作都会导致奖励以及环境的下一个状态。
In the dynamic mission abort problem, the action-value function Q ( s t , a t ) Q s t , a t Q(s_(t),a_(t))Q\left(\mathbf{s}_{t}, a_{t}\right) is the expected total profit of taking actions to abort or continue the mission in each state s t s t s_(t)\mathbf{s}_{t}. The Q-learning algorithm gradually learns the optimal mission abort policy from the interactions with simulation environment. Once the s t + 1 s t + 1 s_(t+1)\mathbf{s}_{t+1} and r t r t r_(t)r_{t} are obtained after executing an action a t a t a_(t)a_{t} at one iteration, the Q-learning will immediately update the actionvalue function by:
在动态任务中止问题中,动作价值函数 Q ( s t , a t ) Q s t , a t Q(s_(t),a_(t))Q\left(\mathbf{s}_{t}, a_{t}\right) 是指在每个状态 s t s t s_(t)\mathbf{s}_{t} 下采取中止或继续任务的行动所期望的总利润。Q 学习算法通过与仿真环境的交互逐渐学习最优的任务中止策略。一旦在某次迭代中执行动作 a t a t a_(t)a_{t} 后获得 s t + 1 s t + 1 s_(t+1)\mathbf{s}_{t+1} r t r t r_(t)r_{t} ,Q 学习将立即通过以下方式更新动作价值函数:

Q ( s t , a t ) Q ( s t , a t ) + α [ r t + max a { a 0 , a 1 } Q ( s t + 1 , a ) Q ( s t , a t ) ] Q s t , a t Q s t , a t + α r t + max a a 0 , a 1 Q s t + 1 , a Q s t , a t Q(s_(t),a_(t))larr Q(s_(t),a_(t))+alpha[r_(t)+max_(a in{a_(0),a_(1)})Q(s_(t+1),a)-Q(s_(t),a_(t))]Q\left(\mathbf{s}_{t}, a_{t}\right) \leftarrow Q\left(\mathbf{s}_{t}, a_{t}\right)+\alpha\left[r_{t}+\max _{a \in\left\{a_{0}, a_{1}\right\}} Q\left(\mathbf{s}_{t+1}, a\right)-Q\left(\mathbf{s}_{t}, a_{t}\right)\right].
The pseudo-code of the Q-learning algorithm is shown in Fig. 6. Line 1 is responsible for initializing the action-value function. Lines 2 9 2 9 2-92-9 detail the process for updating the action-value function based on interactions with the environment, and the M M MM in line 2 denotes the iteration number of Q-learning algorithm. The code for implementing the Q-
Q 学习算法的伪代码如图 6 所示。第 1 行负责初始化动作价值函数。第 2 9 2 9 2-92-9 行详细描述了基于与环境交互更新动作价值函数的过程,第 2 行中的 M M MM 表示 Q 学习算法的迭代次数。实现 Q 的代码

v k + 1 ( s t ) max a t { a 0 , a 1 } { c m a t = a 0 s t + 1 P ( s t + 1 s t , a t = a 1 ) [ r t ( s t , a t , s t + 1 ) + v k ( s t + 1 ) ] a t = a 1 , for all s t S . v k + 1 s t max a t a 0 , a 1 c m      a t = a 0 s t + 1 P s t + 1 s t , a t = a 1 r t s t , a t , s t + 1 + v k s t + 1      a t = a 1 ,  for all  s t S . v_(k+1)(s_(t))larrmax_(a_(t)in{a_(0),a_(1)}){[-c_(m),a_(t)=a_(0)],[sum_(s_(t+1))P(s_(t+1)∣s_(t),a_(t)=a_(1))*[r_(t)(s_(t),a_(t),s_(t+1))+v_(k)(s_(t+1))],a_(t)=a_(1)","" for all "s_(t)inS.]:}v_{k+1}\left(\mathbf{s}_{t}\right) \leftarrow \max _{a_{t} \in\left\{a_{0}, a_{1}\right\}} \begin{cases}-c_{\mathrm{m}} & a_{t}=a_{0} \\ \sum_{\mathbf{s}_{t+1}} P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}=a_{1}\right) \cdot\left[r_{t}\left(\mathbf{s}_{t}, a_{t}, \mathbf{s}_{t+1}\right)+v_{k}\left(\mathbf{s}_{t+1}\right)\right] & a_{t}=a_{1}, \text { for all } \mathbf{s}_{t} \in \mathscr{S} .\end{cases}
where v k ( ) v k ( ) v_(k)(*)v_{k}(\cdot) is the updated value function at the k k kk th iteration.
在第 k k kk 次迭代中, v k ( ) v k ( ) v_(k)(*)v_{k}(\cdot) 是更新后的值函数。

The value iteration algorithm terminates as soon as the value function changes less than a specified threshold in a sweep of all the states. The pseudo-code of the value iteration algorithm is given in Fig. 5. Lines 2 8 2 8 2-82-8 show the updating processes of value function using Bellman optimality equation until the value function changes less than a specified threshold. It is noted that, the value iteration algorithm is only applicable in situations where a perfect model of the environment is available, that is, the state transition probabilities P ( s t + 1 s t , a t ) P s t + 1 s t , a t P(s_(t+1)∣s_(t),a_(t))P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}\right) are exactly known (line 5). The code for implementing the value iteration algorithm can be found in Fig. A. 1 in Appendix A of the supplementary materials. learning algorithm can be found in Fig. A. 2 in Appendix A of the supplementary materials.
值迭代算法在值函数的变化小于指定阈值时终止,遍历所有状态。值迭代算法的伪代码见图 5。第 2 8 2 8 2-82-8 行显示了使用贝尔曼最优性方程更新值函数的过程,直到值函数的变化小于指定阈值。需要注意的是,值迭代算法仅适用于环境的完美模型可用的情况,即状态转移概率 P ( s t + 1 s t , a t ) P s t + 1 s t , a t P(s_(t+1)∣s_(t),a_(t))P\left(\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}\right) 是完全已知的(第 5 行)。实现值迭代算法的代码可以在补充材料的附录 A 中的图 A.1 中找到。学习算法的代码可以在补充材料的附录 A 中的图 A.2 中找到。
Dueling deep Q-network: DQN is pioneer contribution to incorporate the neural networks into RL to overcome the curse of dimensionality [27]. DQN leverages a neural network as a function approximator to estimate the action-value function, allowing it to be applicable to MDPs with continuous and high-dimensional state space. In DQN algorithm for dynamic mission abort problems, the action-value function, i.e., the expected total profit of taking an action a t { a 0 , a 1 } a t a 0 , a 1 a_(t)in{a_(0),a_(1)}a_{t} \in\left\{a_{0}, a_{1}\right\} in system state s t = { X t , G t , δ t } s t = X t , G t , δ t s_(t)={X_(t),G_(t),delta_(t)}\mathbf{s}_{t}=\left\{X_{t}, G_{t}, \delta_{t}\right\}, is represented by Q ( s t , a t ; θ ) Q s t , a t ; θ Q(s_(t),a_(t);theta)Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right) parameterized by θ θ theta\theta, and it is training from the experience generated from the
对抗深度 Q 网络:DQN 是将神经网络引入强化学习以克服维度诅咒的开创性贡献[27]。DQN 利用神经网络作为函数逼近器来估计动作价值函数,使其适用于具有连续和高维状态空间的马尔可夫决策过程(MDP)。在动态任务中止问题的 DQN 算法中,动作价值函数,即在系统状态 s t = { X t , G t , δ t } s t = X t , G t , δ t s_(t)={X_(t),G_(t),delta_(t)}\mathbf{s}_{t}=\left\{X_{t}, G_{t}, \delta_{t}\right\} 下采取动作 a t { a 0 , a 1 } a t a 0 , a 1 a_(t)in{a_(0),a_(1)}a_{t} \in\left\{a_{0}, a_{1}\right\} 的预期总利润,由 Q ( s t , a t ; θ ) Q s t , a t ; θ Q(s_(t),a_(t);theta)Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right) 表示,并由 θ θ theta\theta 参数化,它是从生成的经验中进行训练的。
Algorithm 2: Q-learning
    Initialize \(Q\left(\mathbf{s}_{t}, a_{t}\right)\) for \(\mathbf{s}_{t} \in \mathrm{~S}, a_{t} \in \mathrm{~A}\) arbitrarily (e.g., to all zeros)
    For episode \(\leftarrow 1\) to \(M\) do
        Initialize state \(s_{t}\)
        For \(t \leftarrow 1\) to \(T\) do
            Select action \(a_{t}\) using \(\varepsilon\)-greedy policy based on \(Q\left(\mathbf{s}_{t}, a_{t}\right)\)
            Execute action \(a_{t}\) and observe reward \(r_{t}\) and \(\mathbf{s}_{t+1}\)
            Update \(Q\left(\mathbf{s}_{t}, a_{t}\right) \leftarrow Q\left(\mathbf{s}_{t}, a_{t}\right)+\alpha\left[r_{t}+\max _{a \in A} Q\left(\mathbf{s}_{t+1}, a\right)-Q\left(\mathbf{s}_{t}, a_{t}\right)\right]\)
        End For
    End For
Fig. 6. The pseudo-code of the Q-learning algorithm.
图 6. Q 学习算法的伪代码。

interaction between the agent and the environment. The input of the Q-network consists of system state X t X t X_(t)X_{t}, system’s cumulative performance G t G t G_(t)G_{t}, and the remaining mission time δ t δ t delta_(t)\delta_{t}, and the output is the action-value of taking actions to abort or continue the mission.
代理与环境之间的交互。Q 网络的输入包括系统状态 X t X t X_(t)X_{t} 、系统的累积性能 G t G t G_(t)G_{t} 和剩余任务时间 δ t δ t delta_(t)\delta_{t} ,输出是采取中止或继续任务的行动价值。
In DQN algorithm, the target network and experience replay techniques are developed to ensure stability during the training process. The experience replay technique is employed to store transition ( s t , a t , r t s t , a t , r t (s_(t),a_(t),r_(t):}\left(\mathbf{s}_{t}, a_{t}, r_{t}\right., s t + 1 s t + 1 s_(t+1)\mathbf{s}_{t+1} ) of each interaction in memory buffer D D DD. When it comes time to update the network, a batch of experiences is randomly sampled from the replay buffer, instead of using only the most recent experiences. The random sampling mechanism can reduce correlations between data, enhancing the stability and efficiency of learning. The target network Q ( s t , a t ; θ ) Q s t , a t ; θ Q(s_(t),a_(t);theta^(-))Q\left(\mathbf{s}_{t}, a_{t} ; \theta^{-}\right)is introduced to overcome the instabilities during training due to the updates to the Q-network involve its own predictions. The target network is a clone of the Q-network, but its parameters are updated more slowly. When updating the Q-network, the target network is used to estimate the value of the action that maximizes future rewards. This strategy employs an older parameter set to generate targets, providing a delay between the update to Q-network and the time it impacts the targets. This mechanism can reduce fluctuations and instabilities during the learning process. Through the target network and experience replay techniques, the parameters of the value network can then be updated to minimize the loss function of a minibatch of transitions:
在 DQN 算法中,目标网络和经验回放技术被开发出来以确保训练过程中的稳定性。经验回放技术用于将每次交互的转移 ( s t , a t , r t s t , a t , r t (s_(t),a_(t),r_(t):}\left(\mathbf{s}_{t}, a_{t}, r_{t}\right. s t + 1 s t + 1 s_(t+1)\mathbf{s}_{t+1} 存储在内存缓冲区 D D DD 中。当需要更新网络时,从回放缓冲区随机抽取一批经验,而不是仅使用最新的经验。随机抽样机制可以减少数据之间的相关性,从而增强学习的稳定性和效率。目标网络 Q ( s t , a t ; θ ) Q s t , a t ; θ Q(s_(t),a_(t);theta^(-))Q\left(\mathbf{s}_{t}, a_{t} ; \theta^{-}\right) 的引入是为了克服由于 Q 网络更新涉及其自身预测而导致的训练不稳定性。目标网络是 Q 网络的克隆,但其参数更新得更慢。在更新 Q 网络时,目标网络用于估计最大化未来奖励的动作的价值。这一策略使用较旧的参数集来生成目标,从而在更新 Q 网络和影响目标之间提供了延迟。该机制可以减少学习过程中的波动和不稳定性。通过目标网络和经验回放技术,价值网络的参数可以更新,以最小化一小批转换的损失函数:

L ( θ ) = 1 M j = 1 M [ y j Q ( s t , a t ; θ ) ] 2 L ( θ ) = 1 M j = 1 M y j Q s t , a t ; θ 2 L(theta)=(1)/(M)sum_(j=1)^(M)[y_(j)-Q(s_(t),a_(t);theta)]^(2)L(\theta)=\frac{1}{M} \sum_{j=1}^{M}\left[y_{j}-Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right)\right]^{2} where M M MM is the size of minibatch and y j y j y_(j)y_{j} is the target value which can be evaluated by:
L ( θ ) = 1 M j = 1 M [ y j Q ( s t , a t ; θ ) ] 2 L ( θ ) = 1 M j = 1 M y j Q s t , a t ; θ 2 L(theta)=(1)/(M)sum_(j=1)^(M)[y_(j)-Q(s_(t),a_(t);theta)]^(2)L(\theta)=\frac{1}{M} \sum_{j=1}^{M}\left[y_{j}-Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right)\right]^{2} 其中 M M MM 是小批量的大小, y j y j y_(j)y_{j} 是可以通过以下方式评估的目标值:

y i = { r j if episode terminates at step j + 1 r j + γ max a Q ( s j + 1 , a ; θ ) otherwise y i = r j       if episode terminates at step  j + 1 r j + γ max a Q s j + 1 , a ; θ      otherwise  y_(i)={[r_(j)," if episode terminates at step "j+1],[r_(j)+gammamax_(a^('))Q(s_(j+1),a^(');theta^(-)),"otherwise "]:}y_{i}=\left\{\begin{array}{ll}r_{j} & \text { if episode terminates at step } j+1 \\ r_{j}+\gamma \max _{a^{\prime}} Q\left(\mathbf{s}_{j+1}, a^{\prime} ; \theta^{-}\right) & \text {otherwise }\end{array}\right..
Dueling DQN is a popular variant of DQN where a dueling architecture is introduced to enhance the performance of DQN [44]. Dueling DQN splits the Q-network into two streams: one for estimating the state value function v ( s t ; θ ) v s t ; θ v(s_(t);theta)v\left(\mathbf{s}_{t} ; \theta\right) and another for estimating the advantage function A ( s t , a t ; θ ) A s t , a t ; θ A(s_(t),a_(t);theta)A\left(\mathbf{s}_{t}, a_{t} ; \theta\right) parameterized by θ θ theta\theta. The state value function represents the expected cumulative reward obtainable from a given state s t s t s_(t)s_{t}, while the advantage function indicates the relative advantage of choosing action a t a t a_(t)a_{t} over other possible actions in that state. The core innovation of Dueling DQN lies in its independent estimation of the state value function and advantage function, and they are then recombined at the final stage of the network to evaluate the action-value function Q ( s t , a t ; θ ) Q s t , a t ; θ Q(s_(t),a_(t);theta)Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right) for each action, that is,
对抗 DQN 是 DQN 的一种流行变体,其中引入了对抗架构以增强 DQN 的性能[44]。对抗 DQN 将 Q 网络分为两个流:一个用于估计状态价值函数 v ( s t ; θ ) v s t ; θ v(s_(t);theta)v\left(\mathbf{s}_{t} ; \theta\right) ,另一个用于估计由 θ θ theta\theta 参数化的优势函数 A ( s t , a t ; θ ) A s t , a t ; θ A(s_(t),a_(t);theta)A\left(\mathbf{s}_{t}, a_{t} ; \theta\right) 。状态价值函数表示从给定状态 s t s t s_(t)s_{t} 可获得的期望累积奖励,而优势函数则指示在该状态下选择动作 a t a t a_(t)a_{t} 相对于其他可能动作的相对优势。对抗 DQN 的核心创新在于其对状态价值函数和优势函数的独立估计,然后在网络的最终阶段将它们重新组合,以评估每个动作的动作价值函数 Q ( s t , a t ; θ ) Q s t , a t ; θ Q(s_(t),a_(t);theta)Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right)

Q ( s t , a t ; θ ) = v ( s t ; θ ) + A ( s t , a t ; θ ) 1 | A | a A ( s t , a ; θ ) Q s t , a t ; θ = v s t ; θ + A s t , a t ; θ 1 | A | a A s t , a ; θ Q(s_(t),a_(t);theta)=v(s_(t);theta)+A(s_(t),a_(t);theta)-(1)/(|A|)sum_(a^('))A(s_(t),a^(');theta)Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right)=v\left(\mathbf{s}_{t} ; \theta\right)+A\left(\mathbf{s}_{t}, a_{t} ; \theta\right)-\frac{1}{|\mathscr{A}|} \sum_{a^{\prime}} A\left(\mathbf{s}_{t}, a^{\prime} ; \theta\right),
Additionally, the soft updating strategy is utilized to update the parameters of target network by blending the weights of target network towards the Q-network weights.
此外,采用软更新策略通过将目标网络的权重与 Q 网络的权重进行混合来更新目标网络的参数。

θ τ θ + ( 1 τ ) θ θ τ θ + ( 1 τ ) θ theta^(-)larr tau theta+(1-tau)theta^(-)\theta^{-} \leftarrow \tau \theta+(1-\tau) \theta^{-},
Algorithm 3: Dueling DQN
    1: Initialize memory buffer \(D\)
    2: Initialize action-value network \(Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right)\) with two streams for \(v\left(\mathbf{s}_{t} ; \theta\right)\) and \(A\left(\mathbf{s}_{t}, a_{t} ; \theta\right)\)
    3: Initialize target action-value network \(Q\left(\mathbf{s}_{t}, a_{t} ; \theta^{-}\right)\)where \(\theta^{-}=\theta\)
    4: For episode \(\leftarrow 1\) to \(M\) do
    5: Initialize state \(\mathbf{s}_{t}\)
    6: \(\quad\) For \(t \leftarrow 1\) to \(T\) do
    7: \(\quad\) Select action \(a_{t}\) using \(\varepsilon\)-greedy policy based on \(Q\left(\mathbf{s}_{t}, a_{t} ; \theta\right)\)
    9: Execute action \(a_{t}\) and observe reward \(r_{t}\) and \(\mathbf{s}_{t+1}\)
10: Store transition \(\left(\mathbf{s}_{t}, a_{t}, r_{t}, \mathbf{s}_{t+1}\right)\) in replay memory \(D\)
11: Sample random minibatch of transitions \(\left(\mathbf{s}_{j}, a_{j}, r_{j}, \mathbf{s}_{j+1}\right)\) form \(D\)
12: \(\quad\) Evaluate \(y_{i}= \begin{cases}r_{j} & \text { if episode terminates at step } j+1 \\ r_{j}+\max _{a^{\prime}} Q\left(\mathbf{s}_{t}, a^{\prime} ; \theta^{-}\right) & \text {otherwise }\end{cases}\)
13: \(\quad\) Perform a gradient descent on \(\left[y_{j}-\left(v\left(\mathbf{s}_{j} ; \theta\right)+A\left(\mathbf{s}_{j}, a_{j} ; \theta\right)-\sum_{a^{\prime}} A\left(\mathbf{s}_{j}, a^{\prime} ; \theta\right) /|A|\right)\right]^{2}\) with
                respect to the network parameters \(\theta\)
14: Update the target network parameters \(\theta^{-} \leftarrow \tau \theta+(1-\tau) \theta^{-}\)
15: End For
16: End For
Fig. 7. The pseudo-code of the Dueling DQN algorithm.
图 7. 对抗 DQN 算法的伪代码。

where τ τ tau\tau is used to control the rate at which the target network is updated.
其中 τ τ tau\tau 用于控制目标网络更新的速率。
The pseudo-code of the Dueling DQN algorithm is given in Fig. 7. Lines 1-3 are dedicated to initializing the neural network and memory buffer. Lines 4-16 illustrate the training processes of the neural network, informed by interactions with the environment. The code for implementing the Dueling DQN algorithm can be found in Figs. A. 3 and A. 4 in Appendix A of the supplementary materials.
对抗 DQN 算法的伪代码如图 7 所示。第 1-3 行用于初始化神经网络和记忆缓冲区。第 4-16 行展示了神经网络的训练过程,该过程受到与环境交互的影响。实现对抗 DQN 算法的代码可以在补充材料的附录 A 中的图 A.3 和 A.4 中找到。

3.3. Case study 3.3. 案例研究

A gas pipeline system transmission mission abort example is given to exemplify the application of RL algorithms in reliability optimization problems. The pipeline system possesses four possible discrete states, namely, no damage (perfectly functioning state), rupture (completely failed state), and four flaw states with different levels. The transmission capacity of pipeline system in different states is 0 , 1 , 2 , 3 , 4 , 5 , × 10 5 m 3 0 , 1 , 2 , 3 , 4 , 5 , × 10 5 m 3 0,1,2,3,4,5,xx10^(5)m^(3)0,1,2,3,4,5, \times 10^{5} \mathrm{~m}^{3} /day. The mission of the pipeline system is to transmit D = 2.5 × 10 6 m 3 D = 2.5 × 10 6 m 3 D=2.5 xx10^(6)m^(3)D=2.5 \times 10^{6} \mathrm{~m}^{3} gas within one week. The operation cost of the pipeline system is c op = 2 c op  = 2 c_("op ")=2c_{\text {op }}=2 × 10 4 × 10 4 xx10^(4)\times 10^{4} US dollars/day. The successful completion of the transmission mission will lead to a profit of r m = 6 × 10 5 r m = 6 × 10 5 r_(m)=6xx10^(5)r_{\mathrm{m}}=6 \times 10^{5} US dollars, otherwise a loss of c m = 4 × 10 5 c m = 4 × 10 5 c_(m)=4xx10^(5)c_{\mathrm{m}}=4 \times 10^{5} US dollars. If the pipeline fails during the mission, in addition to the mission failure c m c m c_(m)c_{\mathrm{m}}, a loss of c f = 1 × 10 6 c f = 1 × 10 6 c_(f)=1xx10^(6)c_{\mathrm{f}}=1 \times 10^{6} US dollars related to gas leakage will be incurred. The degradation profile of the pipeline system is characterized by a homogenous discrete-state Markov chain, and the transition probability matrix of one day is:
一个气体管道系统传输任务中止的例子被用来说明强化学习算法在可靠性优化问题中的应用。该管道系统具有四种可能的离散状态,即无损伤(完好运行状态)、破裂(完全失效状态)和四种不同级别的缺陷状态。管道系统在不同状态下的传输能力为 0 , 1 , 2 , 3 , 4 , 5 , × 10 5 m 3 0 , 1 , 2 , 3 , 4 , 5 , × 10 5 m 3 0,1,2,3,4,5,xx10^(5)m^(3)0,1,2,3,4,5, \times 10^{5} \mathrm{~m}^{3} /天。管道系统的任务是在一周内传输 D = 2.5 × 10 6 m 3 D = 2.5 × 10 6 m 3 D=2.5 xx10^(6)m^(3)D=2.5 \times 10^{6} \mathrm{~m}^{3} 气体。管道系统的运营成本为 c op = 2 c op  = 2 c_("op ")=2c_{\text {op }}=2 × 10 4 × 10 4 xx10^(4)\times 10^{4} 美元/天。成功完成传输任务将带来 r m = 6 × 10 5 r m = 6 × 10 5 r_(m)=6xx10^(5)r_{\mathrm{m}}=6 \times 10^{5} 美元的利润,否则将造成 c m = 4 × 10 5 c m = 4 × 10 5 c_(m)=4xx10^(5)c_{\mathrm{m}}=4 \times 10^{5} 美元的损失。如果管道在任务期间发生故障,除了任务失败 c m c m c_(m)c_{\mathrm{m}} ,还将产生与气体泄漏相关的 c f = 1 × 10 6 c f = 1 × 10 6 c_(f)=1xx10^(6)c_{\mathrm{f}}=1 \times 10^{6} 美元的损失。管道系统的退化特征由一个均匀的离散状态马尔可夫链描述,且一天的转移概率矩阵为:

P = [ 1 0 0 0 0 0 0.2 0.8 0 0 0 0 0.1 0.1 0.8 0 0 0 0.02 0.05 0.08 0.85 0 0 0 0.02 0.05 0.08 0.85 0 0 0 0.02 0.03 0.05 0.9 ] P = 1 0 0 0 0 0 0.2 0.8 0 0 0 0 0.1 0.1 0.8 0 0 0 0.02 0.05 0.08 0.85 0 0 0 0.02 0.05 0.08 0.85 0 0 0 0.02 0.03 0.05 0.9 P=[[1,0,0,0,0,0],[0.2,0.8,0,0,0,0],[0.1,0.1,0.8,0,0,0],[0.02,0.05,0.08,0.85,0,0],[0,0.02,0.05,0.08,0.85,0],[0,0,0.02,0.03,0.05,0.9]]\mathbf{P}=\left[\begin{array}{cccccc}1 & 0 & 0 & 0 & 0 & 0 \\ 0.2 & 0.8 & 0 & 0 & 0 & 0 \\ 0.1 & 0.1 & 0.8 & 0 & 0 & 0 \\ 0.02 & 0.05 & 0.08 & 0.85 & 0 & 0 \\ 0 & 0.02 & 0.05 & 0.08 & 0.85 & 0 \\ 0 & 0 & 0.02 & 0.03 & 0.05 & 0.9\end{array}\right]
The value iteration, Q-learning, and Dueling DQN algorithms are utilized to solve the dynamic mission abort problem of the gas pipeline system. The parameter settings are as follow: 1) value iteration: the threshold of is 0.001 ; 2 ) 0.001 ; 2 ) 0.001;2)0.001 ; 2) Q-learning: the number of iterations is 10,000 , the parameters of the ε ε epsi\varepsilon-greedy strategy are ε min = 0.01 , ε max = 1.0 ε min = 0.01 , ε max = 1.0 epsi_(min)=0.01,epsi_(max)=1.0\varepsilon_{\min }=0.01, \varepsilon_{\max }=1.0, and Δ ε = 0.005 ; 3 ) Δ ε = 0.005 ; 3 ) Delta epsi=0.005;3)\Delta \varepsilon=0.005 ; 3) Dueling DQN: the Q-network consists of two hidden layers with 16 neurons, the number of iterations is 1000 , the sizes of the memory and minibatch are 2000 and 32, respectively, the soft update rate is 0.01 , the learning rate of Q-network is 0.01 , and the parameters of the ε ε epsi\varepsilon-greedy strategy are ε min = 0.01 , ε max = 1.0 ε min = 0.01 , ε max = 1.0 epsi_(min)=0.01,epsi_(max)=1.0\varepsilon_{\min }=0.01, \varepsilon_{\max }=1.0, and Δ ε = 0.005 Δ ε = 0.005 Delta epsi=0.005\Delta \varepsilon=0.005. The code for the algorithm’s training process can be found in Fig. A.5, and the code for the simulation environment of the gas pipeline system mission abort problem is given in Fig. A. 6 in Appendix A of the supplementary materials.
值迭代、Q 学习和对抗 DQN 算法被用于解决天然气管道系统的动态任务中止问题。参数设置如下:1)值迭代:阈值为 0.001 ; 2 ) 0.001 ; 2 ) 0.001;2)0.001 ; 2) ;Q 学习:迭代次数为 10,000, ε ε epsi\varepsilon -贪婪策略的参数为 ε min = 0.01 , ε max = 1.0 ε min = 0.01 , ε max = 1.0 epsi_(min)=0.01,epsi_(max)=1.0\varepsilon_{\min }=0.01, \varepsilon_{\max }=1.0 ,以及 Δ ε = 0.005 ; 3 ) Δ ε = 0.005 ; 3 ) Delta epsi=0.005;3)\Delta \varepsilon=0.005 ; 3) ;对抗 DQN:Q 网络由两个隐藏层组成,每层有 16 个神经元,迭代次数为 1,000,记忆和小批量的大小分别为 2,000 和 32,软更新率为 0.01,Q 网络的学习率为 0.01, ε ε epsi\varepsilon -贪婪策略的参数为 ε min = 0.01 , ε max = 1.0 ε min = 0.01 , ε max = 1.0 epsi_(min)=0.01,epsi_(max)=1.0\varepsilon_{\min }=0.01, \varepsilon_{\max }=1.0 ,以及 Δ ε = 0.005 Δ ε = 0.005 Delta epsi=0.005\Delta \varepsilon=0.005 。算法训练过程的代码见图 A.5,天然气管道系统任务中止问题的仿真环境代码见附录 A 的图 A.6。
The comparative results, in terms of expected total profit and computational time for these algorithms, are presented in Table 1. As illustrated in Table 1, the three algorithms demonstrate comparable effectiveness in solving the gas pipeline system transmission mission abort problem, while the computational efficiency has a significant difference. Due to the small size of both state space and action space in this case, the value iteration algorithm can solve for the optimal policy in a very short time based on Bellman optimality equation and the perfect model of the environment.
比较结果,包括这些算法的预期总利润和计算时间,见表 1。如表 1 所示,这三种算法在解决天然气管道系统传输任务中止问题方面表现出相似的有效性,但计算效率存在显著差异。由于在这种情况下状态空间和动作空间的规模较小,值迭代算法可以基于贝尔曼最优性方程和环境的完美模型在非常短的时间内求解出最优策略。
Table 1 表 1
Comparative results of different algorithms.
不同算法的比较结果。
Algorithm 算法 v ( s 0 ) ( × 10 5 v s 0 × 10 5 v(s_(0))(xx10^(5):}v\left(s_{0}\right)\left(\times 10^{5}\right. US dollars ) ) ))  v ( s 0 ) ( × 10 5 v s 0 × 10 5 v(s_(0))(xx10^(5):}v\left(s_{0}\right)\left(\times 10^{5}\right. 美元 ) ) )) CPU time ( s ) ( s ) (s)(\mathrm{s}) CPU 时间 ( s ) ( s ) (s)(\mathrm{s})
Value iteration 值迭代 3.598 0.08
Q-learning Q 学习 3.560 10.02
Dueling DQN 对抗 DQN 3.563 6.73
Algorithm v(s_(0))(xx10^(5):} US dollars ) CPU time (s) Value iteration 3.598 0.08 Q-learning 3.560 10.02 Dueling DQN 3.563 6.73| Algorithm | $v\left(s_{0}\right)\left(\times 10^{5}\right.$ US dollars $)$ | CPU time $(\mathrm{s})$ | | :--- | :--- | :--- | | Value iteration | 3.598 | 0.08 | | Q-learning | 3.560 | 10.02 | | Dueling DQN | 3.563 | 6.73 |
Conversely, Q-learning and Dueling DQN algorithms have to learn the optimal policy through extensive interaction with the environment, necessitating greater computational time compared to the value iteration algorithm. Even though the computational time of the Q-learning and Dueling DQN algorithms is influenced by parameter settings, their extensive environment interaction invariably leads to longer computational times than the value iteration algorithm when the problem size is relatively small. Furthermore, the superior function approximation capability of neural networks, coupled with the experience replay mechanism, enables the Dueling DQN algorithm to achieve higher learning efficiency compared to the Q-learning algorithm. The optimal mission abort policy is depicted in Fig. 8. Each subfigure in Fig. 8, i.e., (a), (b), ©, and (d), illustrates the optimal action for different system state and the time in the mission under fixed system cumulative performance G t G t G_(t)G_{t}.
相反,Q 学习和对抗 DQN 算法必须通过与环境的广泛互动来学习最优策略,因此与值迭代算法相比,需要更长的计算时间。尽管 Q 学习和对抗 DQN 算法的计算时间受到参数设置的影响,但它们与环境的广泛互动不可避免地导致在问题规模相对较小时,计算时间比值迭代算法更长。此外,神经网络卓越的函数逼近能力,加上经验重放机制,使得对抗 DQN 算法相比于 Q 学习算法能够实现更高的学习效率。最优任务中止策略如图 8 所示。图 8 中的每个子图,即(a)、(b)、(c)和(d),展示了在固定系统累积性能 G t G t G_(t)G_{t} 下,不同系统状态下的最优动作和任务中的时间。
As illustrated in Fig. 8, the optimal policy exhibits a monotonic trend with the system state, system cumulative performance, and the time in the mission. The optimal policy is bifurcated by a threshold into parts, corresponding to continue the mission action and mission abort action, respectively. When the system’s cumulative performance and mission time are fixed, aborting the mission becomes the optimal action if the system state falls below a certain threshold. For fixed system state and system cumulative performance, continuing the mission is deemed optimal if the time in the mission is lower than a threshold. For fixed system state and the time in the mission, optimal policy is a nondecreasing function of the system cumulative performance, indicating that a greater cumulative performance suggests accelerated mission progress.
如图 8 所示,最优策略与系统状态、系统累计性能和任务时间呈单调趋势。最优策略通过一个阈值分为两部分,分别对应于继续任务行动和中止任务行动。当系统的累计性能和任务时间固定时,如果系统状态低于某个阈值,中止任务成为最优行动。对于固定的系统状态和系统累计性能,如果任务时间低于某个阈值,则继续任务被视为最优。对于固定的系统状态和任务时间,最优策略是系统累计性能的非递减函数,表明更高的累计性能意味着任务进展加快。

4. Reinforcement learning in maintenance optimization
4. 维护优化中的强化学习

Maintenance is an essential and critical activity in the operation phase of engineered systems that can restore system performance and avert system failures [45]. Maintenance optimization aims to identify the optimal maintenance policy that maximizes the system reliability/availability or minimizes the maintenance cost, and it plays an important role in enhancing the operational effectiveness of systems [46,47]. MDP provides a suitable modeling framework for maintenance optimization, and many RL approaches have been developed to resolve maintenance optimization problems. The literature on maintenance optimization can be broadly categorized into two streams: maintenance optimization for single-component systems and maintenance optimization for multi-component systems. Within the maintenance optimization for single-component systems, the degradation of systems is generally formulated as a discrete state degradation model or a continuous state degradation model, and maintenance actions are initiated once the system’s state reaches a scheduled level. Standard dynamic programming algorithms, such as value iteration and policy iteration, are extensively adopted for solving maintenance optimization of single-component systems, and the structural properties of the maintenance policies are typically leveraged to facilitate the identification of optimal policies. maintenance optimization for multi-component systems poses additional challenges due to four sorts of dependencies among components [45], namely, economic, structural, stochastic, and resource dependencies. The optimization of maintenance policies for multi-component systems necessitates a holistic approach, taking into account the states of all components and their dependencies. Maintenance optimization for multi-component systems often suffers from the curse of dimensionality as the state and action spaces for engineering instances, i.e., the possible state combinations and action combinations of all components, are typically quite large. Designing advanced algorithms to estimate the optimal maintenance policies in a computationally efficient manner represents a significant challenge in maintenance optimization for multi-component systems. Since the pioneering contributions by Liu et al. [7,11,48], which showcased the remarkable performance of DRL in addressing dynamic maintenance problems for
维护是在工程系统运行阶段的一项基本且关键的活动,能够恢复系统性能并避免系统故障 [45]。维护优化旨在识别最佳维护策略,以最大化系统的可靠性/可用性或最小化维护成本,并在提高系统的运营效率方面发挥重要作用 [46,47]。马尔可夫决策过程(MDP)为维护优化提供了合适的建模框架,许多强化学习(RL)方法已被开发用于解决维护优化问题。关于维护优化的文献大致可以分为两个方向:单组件系统的维护优化和多组件系统的维护优化。在单组件系统的维护优化中,系统的退化通常被表述为离散状态退化模型或连续状态退化模型,维护行动在系统状态达到预定水平时启动。 标准动态规划算法,如价值迭代和策略迭代,广泛应用于单组件系统的维护优化,维护策略的结构特性通常被利用以促进最优策略的识别。然而,多组件系统的维护优化由于组件之间的四种依赖关系[45],即经济依赖、结构依赖、随机依赖和资源依赖,面临额外的挑战。多组件系统维护策略的优化需要一种整体方法,考虑所有组件的状态及其依赖关系。多组件系统的维护优化常常受到维度诅咒的困扰,因为工程实例的状态和动作空间,即所有组件的可能状态组合和动作组合,通常相当庞大。设计先进算法以计算高效的方式估计最优维护策略在多组件系统的维护优化中代表了一个重大挑战。 自刘等人的开创性贡献以来[7,11,48],展示了深度强化学习在解决动态维护问题方面的卓越表现
Fig. 8. The optimal mission abort policy.
图 8. 最优任务中止策略。

multi-component multi-state systems, RL and DRL algorithms have been extensively incorporated to maintainability community [49,50]. Fig. 9 provides a brief review of the literature on reinforcement learning in maintenance optimization.
多组分多状态系统中,强化学习(RL)和深度强化学习(DRL)算法已被广泛应用于可维护性领域 [49,50]。图 9 对维护优化中强化学习的文献进行了简要回顾。
In this section, we will utilize a condition-based maintenance scenario for multi-state multi-component systems as an example to demonstrate the application of RL within the context of maintenance optimization.
在本节中,我们将利用多状态多组件系统的基于条件的维护场景作为示例,以展示强化学习在维护优化中的应用。

4.1. Markov decision process formulation
4.1. 马尔可夫决策过程的表述

To develop an MDP model for maintenance optimization, it begins with identifying the states of the system related to maintenance demand. These states should reflect various levels of system/component degradation, operational efficiency, and failure conditions, thereby representing the system’s health status and maintenance demand. Secondly, identify the possible maintenance actions that can be performed to restore the states of the system/component, such as minimal repair, imperfect maintenance, and replacement, each with associated costs and expected impacts on system performance. Thirdly, determine the transition probabilities given maintenance actions. These probabilities can be derived from maintenance records and expert assessments. Accurate transition probabilities are crucial for predicting the outcomes of maintenance policies. Lastly, develop a reward function that evaluates the effectiveness of maintenance actions. This function should balance the benefits of improved system performance against the costs of maintenance activities, considering factors such as system performance, system downtime, and maintenance expenses. To sum up, relevant data, including system performance data, historical failure data, operational data, maintenance logs, and cost information, should be collected to support the MDP model development.
为了开发一个用于维护优化的 MDP 模型,首先需要识别与维护需求相关的系统状态。这些状态应反映系统/组件的不同退化水平、操作效率和故障情况,从而代表系统的健康状态和维护需求。其次,识别可以执行的可能维护措施,以恢复系统/组件的状态,例如最小修复、不完全维护和更换,每种措施都有相关的成本和对系统性能的预期影响。第三,确定给定维护措施的转移概率。这些概率可以从维护记录和专家评估中得出。准确的转移概率对于预测维护政策的结果至关重要。最后,开发一个评估维护措施有效性的奖励函数。该函数应平衡改善系统性能的收益与维护活动的成本,考虑系统性能、系统停机时间和维护费用等因素。 总之,应收集相关数据,包括系统性能数据、历史故障数据、操作数据、维护日志和成本信息,以支持 MDP 模型的开发。
Suppose a repairable multistate system consisting of N N NN components. Component k k kk possesses n k n k n_(k)n_{k} possible state { 1 , 2 , , n k } 1 , 2 , , n k {1,2,cdots,n_(k)}\left\{1,2, \cdots, n_{k}\right\}, where state 1 and state n k n k n_(k)n_{k} are the completely failed state and perfectly functioning state, respectively. The state of component k k kk at time step t t tt is represented as X k , t X k , t X_(k,t)X_{k, t}. The performance capacity of component k k kk in state i i ii is denoted as g k , i g k , i g_(k,i)g_{k, i}, and one has g k , i < g k , j , i < j g k , i < g k , j , i < j g_(k,i) < g_(k,j),AA i < jg_{k, i}<g_{k, j}, \forall i<j. The performance capacity of component k k kk at time t t tt is denoted as G k , t { g k , 1 , g k , 2 , , g k , n k } G k , t g k , 1 , g k , 2 , , g k , n k G_(k,t)in{g_(k,1),g_(k,2),cdots,g_(k,n_(k))}G_{k, t} \in\left\{g_{k, 1}, g_{k, 2}, \cdots, g_{k, n_{k}}\right\}. The degradation of each component follows a homogeneous discrete-time Markov chain. Denoting p i , j k p i , j k p_(i,j)^(k)p_{i, j}^{k} as the transition probability of component k k kk from i i ii to j j jj, which satisfies p i , j k = 0 p i , j k = 0 p_(i,j)^(k)=0p_{i, j}^{k}=0 for j > i j > i j > ij>i and j = i n k p i , j k = 1 j = i n k p i , j k = 1 sum_(j=i)^(n_(k))p_(i,j)^(k)=1\sum_{j=i}^{n_{k}} p_{i, j}^{k}=1. The performance capacity of the whole system, denoted as G t G t G_(t)G_{t}, is determined by the performance capacities of all components and the structure function of the system ϕ ( ) ϕ ( ) phi(*)\phi(\cdot), i.e., G t = ϕ ( G 1 , t , G 2 , t , , G N , t ) G t = ϕ G 1 , t , G 2 , t , , G N , t G_(t)=phi(G_(1,t),G_(2,t),cdots,G_(N,t))G_{t}=\phi\left(G_{1, t}, G_{2, t}, \cdots, G_{N, t}\right). For example, the performance capacity of a two-component flow transmission system can be represented by G t = G 1 , t + G 2 , t G t = G 1 , t + G 2 , t G_(t)=G_(1,t)+G_(2,t)G_{t}=G_{1, t}+G_{2, t} if the components are parallelly connected or G t = min { G 1 , t , G 2 , t } G t = min G 1 , t , G 2 , t G_(t)=min{G_(1,t),G_(2,t)}G_{t}=\min \left\{G_{1, t}, G_{2, t}\right\} if they are serially connected. Noteworthy, even though this tutorial adopts Markov processes to model components degradation profiles in both reliability optimization and maintenance optimization cases, the utilization of MDPs to solve reliability optimization and maintenance optimization problems does not rely on this assumption. For the general models where the state transition times obey a non-exponential distribution, the state space needs to be well designed to ensure that the information required for decision-making is covered and that the model is memoryless. For example, for a binarystate system with an arbitrary lifetime distribution, the state space of
假设一个可修复的多状态系统由 N N NN 个组件组成。组件 k k kk 具有 n k n k n_(k)n_{k} 种可能的状态 { 1 , 2 , , n k } 1 , 2 , , n k {1,2,cdots,n_(k)}\left\{1,2, \cdots, n_{k}\right\} ,其中状态 1 和状态 n k n k n_(k)n_{k} 分别是完全失效状态和完美工作状态。组件 k k kk 在时间步 t t tt 的状态表示为 X k , t X k , t X_(k,t)X_{k, t} 。组件 k k kk 在状态 i i ii 的性能能力表示为 g k , i g k , i g_(k,i)g_{k, i} ,并且有 g k , i < g k , j , i < j g k , i < g k , j , i < j g_(k,i) < g_(k,j),AA i < jg_{k, i}<g_{k, j}, \forall i<j 。组件 k k kk 在时间 t t tt 的性能能力表示为 G k , t { g k , 1 , g k , 2 , , g k , n k } G k , t g k , 1 , g k , 2 , , g k , n k G_(k,t)in{g_(k,1),g_(k,2),cdots,g_(k,n_(k))}G_{k, t} \in\left\{g_{k, 1}, g_{k, 2}, \cdots, g_{k, n_{k}}\right\} 。每个组件的退化遵循均匀离散时间马尔可夫链。将 p i , j k p i , j k p_(i,j)^(k)p_{i, j}^{k} 表示为组件 k k kk i i ii j j jj 的转移概率,满足 p i , j k = 0 p i , j k = 0 p_(i,j)^(k)=0p_{i, j}^{k}=0 对于 j > i j > i j > ij>i j = i n k p i , j k = 1 j = i n k p i , j k = 1 sum_(j=i)^(n_(k))p_(i,j)^(k)=1\sum_{j=i}^{n_{k}} p_{i, j}^{k}=1 。整个系统的性能能力,表示为 G t G t G_(t)G_{t} ,由所有组件的性能能力和系统的结构函数 ϕ ( ) ϕ ( ) phi(*)\phi(\cdot) 决定,即 G t = ϕ ( G 1 , t , G 2 , t , , G N , t ) G t = ϕ G 1 , t , G 2 , t , , G N , t G_(t)=phi(G_(1,t),G_(2,t),cdots,G_(N,t))G_{t}=\phi\left(G_{1, t}, G_{2, t}, \cdots, G_{N, t}\right) 。例如,如果组件是并联连接的,则一个两组件流传输系统的性能能力可以表示为 G t = G 1 , t + G 2 , t G t = G 1 , t + G 2 , t G_(t)=G_(1,t)+G_(2,t)G_{t}=G_{1, t}+G_{2, t} ,如果它们是串联连接的,则可以表示为 G t = min { G 1 , t , G 2 , t } G t = min G 1 , t , G 2 , t G_(t)=min{G_(1,t),G_(2,t)}G_{t}=\min \left\{G_{1, t}, G_{2, t}\right\} 。 值得注意的是,尽管本教程采用马尔可夫过程来建模可靠性优化和维护优化案例中的组件退化特征,但利用马尔可夫决策过程(MDP)来解决可靠性优化和维护优化问题并不依赖于这一假设。对于状态转移时间服从非指数分布的一般模型,状态空间需要精心设计,以确保决策所需的信息得到覆盖,并且模型是无记忆的。例如,对于一个具有任意寿命分布的二态系统,其状态空间为
Fig. 9. An overlook of the existing literature on RL in maintenance optimization [7,11,14,51-75].
图 9. 维护优化中强化学习现有文献的概述 [7,11,14,51-75]。

its MDP needs to incorporate the current system operational state as well as the effective age to ensure the memorylessness. Put another way, the failure probability of the system at any future moment can be evaluated by the current state of the MDP, i.e., the system operational state and the effective age.
其马尔可夫决策过程(MDP)需要纳入当前系统的操作状态以及有效年龄,以确保无记忆性。换句话说,系统在未来任何时刻的故障概率可以通过 MDP 的当前状态进行评估,即系统的操作状态和有效年龄。
The planning horizon is T T TT decision steps, where the degradation states of components can be observed at each step. After each decision step, two possible actions can be selected for each component { 0 , 1 } { 0 , 1 } {0,1}\{0,1\} where action 0 corresponds to “do nothing” and action 1 stands for replacement of the component. After a replacement, a component will restore to state n k n k n_(k)n_{k}, i.e., the best state. If component k k kk is functioning when replacement at time step t t tt, i.e., X k , t > 1 X k , t > 1 X_(k,t) > 1X_{k, t}>1, a preventive replacement cost c k PM c k PM c_(k)^(PM)c_{k}^{\mathrm{PM}} is carried out, whereas a more expensive corrective replacement cost c k CM c k CM c_(k)^(CM)c_{k}^{\mathrm{CM}} has to be incurred if the component k k kk has completely failed. At the end of each time step, a profit related to system performance capacity will be earned, i.e., μ G t μ G t muG_(t)\mu G_{t}. The state space, action space, reward function, and Bellman equation of the condition-based problem are formulated as follows.
规划时间范围为 T T TT 个决策步骤,在每个步骤中可以观察到组件的退化状态。在每个决策步骤后,可以为每个组件选择两种可能的行动 { 0 , 1 } { 0 , 1 } {0,1}\{0,1\} ,其中行动 0 对应于“无所作为”,行动 1 则表示更换组件。在更换后,组件将恢复到状态 n k n k n_(k)n_{k} ,即最佳状态。如果组件 k k kk 在时间步骤 t t tt 时仍在运行,即 X k , t > 1 X k , t > 1 X_(k,t) > 1X_{k, t}>1 ,则会进行预防性更换成本 c k PM c k PM c_(k)^(PM)c_{k}^{\mathrm{PM}} ,而如果组件 k k kk 完全失效,则必须承担更昂贵的纠正性更换成本 c k CM c k CM c_(k)^(CM)c_{k}^{\mathrm{CM}} 。在每个时间步骤结束时,将获得与系统性能能力相关的利润,即 μ G t μ G t muG_(t)\mu G_{t} 。条件基础问题的状态空间、行动空间、奖励函数和贝尔曼方程如下所示。
State space: In the CBM problems, the maintenance decisions are made according to the degraded states of all the components. Since the optimal policy is defined over a finite planning horizon, the state space has to contain, in addition to the degraded states of the components, the remaining time steps in mission, i.e., s t = { ( X 1 , t , , X k , t , , X N , t , τ t ) 1 s t = X 1 , t , , X k , t , , X N , t , τ t 1 s_(t)={(X_(1,t),cdots,X_(k,t),cdots,X_(N,t),tau_(t))∣1 <= :}\mathbf{s}_{t}=\left\{\left(X_{1, t}, \cdots, X_{k, t}, \cdots, X_{N, t}, \tau_{t}\right) \mid 1 \leq\right. X k , t n k , 0 τ t T } X k , t n k , 0 τ t T {:X_(k,t) <= n_(k),0 <= tau_(t) <= T}\left.X_{k, t} \leq n_{k}, 0 \leq \tau_{t} \leq T\right\}
状态空间:在 CBM 问题中,维护决策是根据所有组件的退化状态做出的。由于最优策略是在有限的规划时间范围内定义的,因此状态空间除了包含组件的退化状态外,还必须包含任务中剩余的时间步,即 s t = { ( X 1 , t , , X k , t , , X N , t , τ t ) 1 s t = X 1 , t , , X k , t , , X N , t , τ t 1 s_(t)={(X_(1,t),cdots,X_(k,t),cdots,X_(N,t),tau_(t))∣1 <= :}\mathbf{s}_{t}=\left\{\left(X_{1, t}, \cdots, X_{k, t}, \cdots, X_{N, t}, \tau_{t}\right) \mid 1 \leq\right. X k , t n k , 0 τ t T } X k , t n k , 0 τ t T {:X_(k,t) <= n_(k),0 <= tau_(t) <= T}\left.X_{k, t} \leq n_{k}, 0 \leq \tau_{t} \leq T\right\}
Action space: The action of the MDP is denoted as the maintenance actions of all the components, i.e., a t = { a 1 , t , , a k , t , , a N , t } a t = a 1 , t , , a k , t , , a N , t a_(t)={a_(1,t),cdots,a_(k,t),cdots,a_(N,t)}\boldsymbol{a}_{t}=\left\{a_{1, t}, \cdots, a_{k, t}, \cdots, a_{N, t}\right\}, where a k , t a k , t a_(k,t)a_{k, t} = 1 = 1 =1=1 represents that degraded component k k kk is selected to be maintained at time step t t tt and a k , t = 0 a k , t = 0 a_(k,t)=0a_{k, t}=0 represents that “do nothing” action is selected.
行动空间:MDP 的行动表示为所有组件的维护行动,即 a t = { a 1 , t , , a k , t , , a N , t } a t = a 1 , t , , a k , t , , a N , t a_(t)={a_(1,t),cdots,a_(k,t),cdots,a_(N,t)}\boldsymbol{a}_{t}=\left\{a_{1, t}, \cdots, a_{k, t}, \cdots, a_{N, t}\right\} ,其中 a k , t a k , t a_(k,t)a_{k, t} = 1 = 1 =1=1 表示在时间步 t t tt 选择对退化组件 k k kk 进行维护,而 a k , t = 0 a k , t = 0 a_(k,t)=0a_{k, t}=0 表示选择“无所作为”行动。

The transition probability of system after executing an action, denoted as Pr { s t + 1 s t , a t } Pr s t + 1 s t , a t Pr{s_(t+1)∣s_(t),a_(t)}\operatorname{Pr}\left\{\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \boldsymbol{a}_{t}\right\}, can be calculated by the state transition of components, and it is formulated as:
系统在执行一个动作后的转移概率,记作 Pr { s t + 1 s t , a t } Pr s t + 1 s t , a t Pr{s_(t+1)∣s_(t),a_(t)}\operatorname{Pr}\left\{\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \boldsymbol{a}_{t}\right\} ,可以通过组件的状态转移来计算,其公式为:

Pr { s t + 1 s t , a t } = k = 1 N Pr { X k , t + 1 X k , t , a k , t } I ( τ t + 1 = τ t 1 ) Pr s t + 1 s t , a t = k = 1 N Pr X k , t + 1 X k , t , a k , t I τ t + 1 = τ t 1 Pr{s_(t+1)∣s_(t),a_(t)}=prod_(k=1)^(N)Pr{X_(k,t+1)∣X_(k,t),a_(k,t)}*I(tau_(t+1)=tau_(t)-1)\operatorname{Pr}\left\{\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, a_{t}\right\}=\prod_{k=1}^{N} \operatorname{Pr}\left\{X_{k, t+1} \mid X_{k, t}, a_{k, t}\right\} \cdot I\left(\tau_{t+1}=\tau_{t}-1\right),
where I ( ) I ( ) I(*)I(\cdot) is the indicator function and I ( τ t + 1 = τ t 1 ) I τ t + 1 = τ t 1 I(tau_(t+1)=tau_(t)-1)I\left(\tau_{t+1}=\tau_{t}-1\right) indicates the deterministic transition of remaining time steps in the mission. This means that after performing an action, the remaining time step will be reduced by one. Pr { X k , t + 1 X k , t , a k , t } Pr X k , t + 1 X k , t , a k , t Pr{X_(k,t+1)∣X_(k,t),a_(k,t)}\operatorname{Pr}\left\{X_{k, t+1} \mid X_{k, t}, a_{k, t}\right\} is the transition probability of component k k kk from state X k , t X k , t X_(k,t)X_{k, t} to X k , t + 1 X k , t + 1 X_(k,t+1)X_{k, t+1} after executing maintenance action a k , t a k , t a_(k,t)a_{k, t}, and it can be evaluated by:
其中 I ( ) I ( ) I(*)I(\cdot) 是指示函数, I ( τ t + 1 = τ t 1 ) I τ t + 1 = τ t 1 I(tau_(t+1)=tau_(t)-1)I\left(\tau_{t+1}=\tau_{t}-1\right) 表示任务中剩余时间步的确定性转移。这意味着在执行一个动作后,剩余时间步将减少一个。 Pr { X k , t + 1 X k , t , a k , t } Pr X k , t + 1 X k , t , a k , t Pr{X_(k,t+1)∣X_(k,t),a_(k,t)}\operatorname{Pr}\left\{X_{k, t+1} \mid X_{k, t}, a_{k, t}\right\} 是组件 k k kk 从状态 X k , t X k , t X_(k,t)X_{k, t} 转移到状态 X k , t + 1 X k , t + 1 X_(k,t+1)X_{k, t+1} 的转移概率,在执行维护动作 a k , t a k , t a_(k,t)a_{k, t} 后,可以通过以下方式评估:

Pr { X k , t + 1 X k , t , a k , t } = { p X k , t , X k , t + 1 k a k , t = 0 I ( X k , t = 1 = n k ) a k , t = 1 Pr X k , t + 1 X k , t , a k , t = p X k , t , X k , t + 1 k a k , t = 0 I X k , t = 1 = n k a k , t = 1 Pr{X_(k,t+1)∣X_(k,t),a_(k,t)}={[p_(X_(k,t),X_(k,t+1))^(k),a_(k,t)=0],[I(X_(k,t=1)=n_(k)),a_(k,t)=1]:}\operatorname{Pr}\left\{X_{k, t+1} \mid X_{k, t}, a_{k, t}\right\}=\left\{\begin{array}{ll}p_{X_{k, t}, X_{k, t+1}}^{k} & a_{k, t}=0 \\ I\left(X_{k, t=1}=n_{k}\right) & a_{k, t}=1\end{array}\right.,
where I ( X k , t = 1 = n k ) I X k , t = 1 = n k I(X_(k,t=1)=n_(k))I\left(X_{k, t=1}=n_{k}\right) indicates that the state of component k k kk will be restored to the best state, i.e., the system state will transit to the best state with a probability of one.
其中 I ( X k , t = 1 = n k ) I X k , t = 1 = n k I(X_(k,t=1)=n_(k))I\left(X_{k, t=1}=n_{k}\right) 表示组件 k k kk 的状态将恢复到最佳状态,即系统状态将以概率为 1 的方式过渡到最佳状态。
Reward: The maintenance cost is related to the current states of components. Considering the economic dependency among the components, the associated maintenance cost of the system can be written as:
奖励:维护成本与组件的当前状态相关。考虑到组件之间的经济依赖,系统的相关维护成本可以表示为:

C t = { c s + k = 1 N c k , t if any component is maintened 0 otherwise C t = c s + k = 1 N c k , t  if any component is maintened  0  otherwise  C_(t)={[c_(s)+sum_(k=1)^(N)c_(k,t)," if any component is maintened "],[0," otherwise "]:}C_{t}=\left\{\begin{array}{ll}c_{s}+\sum_{k=1}^{N} c_{k, t} & \text { if any component is maintened } \\ 0 & \text { otherwise }\end{array}\right..
where c s c s c_(s)c_{s} is the maintenance cost associated with transportation, maintenance setup, and maintenance personnel, and c k , t c k , t c_(k,t)c_{k, t} is the maintenance cost of component k k kk at time t t tt, and it can be calculated as:
其中 c s c s c_(s)c_{s} 是与运输、维护设置和维护人员相关的维护成本, c k , t c k , t c_(k,t)c_{k, t} 是在时间 t t tt 时组件 k k kk 的维护成本,可以计算为:

c k , t = { 0 a k , t = 0 c k PM a k , t = 1 , s k < n k c k CM a k , t = 1 , s k = n k c k , t = 0 a k , t = 0 c k PM a k , t = 1 , s k < n k c k CM a k , t = 1 , s k = n k c_(k,t)={[0,a_(k,t)=0],[c_(k)^(PM),a_(k,t)=1","s_(k) < n_(k)],[c_(k)^(CM),a_(k,t)=1","s_(k)=n_(k)]:}c_{k, t}=\left\{\begin{array}{ll}0 & a_{k, t}=0 \\ c_{k}^{\mathrm{PM}} & a_{k, t}=1, s_{k}<n_{k} \\ c_{k}^{\mathrm{CM}} & a_{k, t}=1, s_{k}=n_{k}\end{array}\right..
The reward of the MDP is defined as the profit, and it can be formulated as:
MDP 的奖励定义为利润,可以表述为:

r ( s t , a t , s t ) = μ G t C t r s t , a t , s t = μ G t C t r(s_(t),a_(t),s_(t))=muG_(t)-C_(t)r\left(\mathbf{s}_{t}, \boldsymbol{a}_{t}, \mathbf{s}_{t}\right)=\mu G_{t}-C_{t}.
Based on the above elements, the Bellman equation of the MDP can be, therefore, formulated as:
基于上述元素,MDP 的贝尔曼方程可以因此被表述为:

v ( s t ) = max a t s t + 1 Pr { s t + 1 s t , a t } [ r ( s t , a t , s t + 1 ) + γ v ( s t + 1 ) ] v s t = max a t s t + 1 Pr s t + 1 s t , a t r s t , a t , s t + 1 + γ v s t + 1 v^(**)(s_(t))=max_(a_(t))sum_(s_(t+1))Pr{s_(t+1)∣s_(t),a_(t)}[r(s_(t),a_(t),s_(t+1))+gammav^(**)(s_(t+1))]v^{*}\left(\mathbf{s}_{t}\right)=\max _{\boldsymbol{a}_{t}} \sum_{\mathbf{s}_{t+1}} \operatorname{Pr}\left\{\mathbf{s}_{t+1} \mid \mathbf{s}_{t}, \boldsymbol{a}_{t}\right\}\left[r\left(\mathbf{s}_{t}, \boldsymbol{a}_{t}, \mathbf{s}_{t+1}\right)+\gamma v^{*}\left(\mathbf{s}_{t+1}\right)\right].

4.2. Solution algorithms
4.2. 解决算法

The value iteration, Q-learning, and Dueling DQN algorithm in Section 3.2 can be readily modified to solve the maintenance optimization problem. In this section, another policy-based DRL algorithm, proximal policy optimization (PPO) [76], is introduced. In the PPO algorithm, an agent is defined by the actor-critic architecture. The actor estimates the optimal maintenance actions at each state π ( a t s t ; θ ) π a t s t ; θ pi(a_(t)∣s_(t);theta)\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta\right), whereas the critic evaluates the optimal value function v ( s t ; β ) v s t ; β v(s_(t);beta)v\left(\mathbf{s}_{t} ; \beta\right) which denotes the total expected profit. The actor and critic are represented by two neural networks parameterized θ θ theta\theta and β β beta\beta, respectively. The inputs of actor and critic networks are identical and consist of the degraded states of the components X k , t , k = 1 , , N X k , t , k = 1 , , N X_(k,t),k=1,cdots,NX_{k, t}, k=1, \cdots, N and the remaining time steps in mission τ t τ t tau_(t)\tau_{t}. The output of the actor network is the probability of taking each action, whereas the output of the critic network is the value function of a specific state.
在第 3.2 节中,值迭代、Q 学习和对抗 DQN 算法可以很容易地修改以解决维护优化问题。在本节中,介绍了另一种基于策略的深度强化学习算法,近端策略优化(PPO)[76]。在 PPO 算法中,代理由演员-评论家架构定义。演员在每个状态 π ( a t s t ; θ ) π a t s t ; θ pi(a_(t)∣s_(t);theta)\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta\right) 下估计最佳维护动作,而评论家评估最佳价值函数 v ( s t ; β ) v s t ; β v(s_(t);beta)v\left(\mathbf{s}_{t} ; \beta\right) ,该函数表示总预期利润。演员和评论家分别由两个参数化的神经网络 θ θ theta\theta β β beta\beta 表示。演员和评论家网络的输入是相同的,包括组件的退化状态 X k , t , k = 1 , , N X k , t , k = 1 , , N X_(k,t),k=1,cdots,NX_{k, t}, k=1, \cdots, N 和任务中剩余的时间步 τ t τ t tau_(t)\tau_{t} 。演员网络的输出是采取每个动作的概率,而评论家网络的输出是特定状态的价值函数。
In the PPO algorithm, the agent repeatedly interacts with environment in each iteration to collect trajectories ( s t , a t , r t ) s t , a t , r t (s_(t),a_(t),r_(t))\left(\mathbf{s}_{t}, \boldsymbol{a}_{t}, r_{t}\right) for agent training. The parameters of the agent, i.e., the parameters of actor and critic networks, are updated at the end of each iteration. Let θ old θ old  theta_("old ")\theta_{\text {old }} denote the parameters of actor network before updating, and the actor network is updated by minimizing the clipped surrogate objective as following:
在 PPO 算法中,智能体在每次迭代中反复与环境交互以收集轨迹 ( s t , a t , r t ) s t , a t , r t (s_(t),a_(t),r_(t))\left(\mathbf{s}_{t}, \boldsymbol{a}_{t}, r_{t}\right) 用于智能体训练。智能体的参数,即演员网络和评论家网络的参数,在每次迭代结束时更新。设 θ old θ old  theta_("old ")\theta_{\text {old }} 表示更新前的演员网络参数,演员网络通过最小化剪切替代目标进行更新,如下所示:

L ( θ ) = E ^ t [ min { r t ( θ ) A t , f clip ( r t ( θ ) , 1 δ , 1 + δ ) } A ^ t ] L ( θ ) = E ^ t min r t ( θ ) A t , f clip  r t ( θ ) , 1 δ , 1 + δ A ^ t L(theta)= widehat(E)_(t)[min{r_(t)(theta)A_(t),f_("clip ")(r_(t)(theta),1-delta,1+delta)} widehat(A)_(t)]L(\theta)=\widehat{\mathbb{E}}_{t}\left[\min \left\{r_{t}(\theta) A_{t}, f_{\text {clip }}\left(r_{t}(\theta), 1-\delta, 1+\delta\right)\right\} \widehat{A}_{t}\right],
where r t ( θ ) r t ( θ ) r_(t)(theta)r_{t}(\theta) represent the probability ratio of the updated policy and old policy, A ^ t A ^ t widehat(A)_(t)\widehat{A}_{t} is the estimator of the advantage function at decision period t t tt, and f clip ( ) f clip  ( ) f_("clip ")(*)f_{\text {clip }}(\cdot) is the clip function, and they are formulated as following:
其中 r t ( θ ) r t ( θ ) r_(t)(theta)r_{t}(\theta) 代表更新策略与旧策略的概率比, A ^ t A ^ t widehat(A)_(t)\widehat{A}_{t} 是决策周期 t t tt 的优势函数估计, f clip ( ) f clip  ( ) f_("clip ")(*)f_{\text {clip }}(\cdot) 是截断函数,它们的公式如下:

r t ( θ ) = π ( a t s t ; θ ) π ( a t s t ; θ old ) r t ( θ ) = π a t s t ; θ π a t s t ; θ old  r_(t)(theta)=(pi(a_(t)∣s_(t);theta))/(pi(a_(t)∣s_(t);theta_("old ")))r_{t}(\theta)=\frac{\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta\right)}{\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta_{\text {old }}\right)}
A ^ t = t > t r t v ( s t ; β ) A ^ t = t > t r t v s t ; β widehat(A)_(t)=sum_(t^(') > t)r_(t^('))-v(s_(t);beta)\widehat{A}_{t}=\sum_{t^{\prime}>t} r_{t^{\prime}}-v\left(\mathbf{s}_{t} ; \beta\right)
f clip ( r t ( θ ) , 1 δ , 1 + δ ) = { 1 δ r t ( θ ) < 1 δ r t ( θ ) 1 δ r t ( θ ) 1 + δ 1 + δ r t ( θ ) > 1 + δ f clip  r t ( θ ) , 1 δ , 1 + δ = 1 δ r t ( θ ) < 1 δ r t ( θ ) 1 δ r t ( θ ) 1 + δ 1 + δ r t ( θ ) > 1 + δ f_("clip ")(r_(t)(theta),1-delta,1+delta)={[1-delta,r_(t)(theta) < 1-delta],[r_(t)(theta),1-delta <= r_(t)(theta) <= 1+delta],[1+delta,r_(t)(theta) > 1+delta]:}f_{\text {clip }}\left(r_{t}(\theta), 1-\delta, 1+\delta\right)= \begin{cases}1-\delta & r_{t}(\theta)<1-\delta \\ r_{t}(\theta) & 1-\delta \leq r_{t}(\theta) \leq 1+\delta \\ 1+\delta & r_{t}(\theta)>1+\delta\end{cases}
where δ δ delta\delta is the clipping ratio which is used to the maximum allowed change in the policy at each update.
其中 δ δ delta\delta 是剪切比率,用于每次更新时政策允许的最大变化。
The parameters of the critic network are updated by:
评论网络的参数通过以下方式更新:

L ( β ) = E ^ t [ ( t > t r t v t ( s t ; β ) ) 2 ] L ( β ) = E ^ t t > t r t v t s t ; β 2 L(beta)= widehat(E)_(t)[(sum_(t^(') > t)r_(t^('))-v_(t)(s_(t);beta))^(2)]L(\beta)=\widehat{\mathbb{E}}_{t}\left[\left(\sum_{t^{\prime}>t} r_{t^{\prime}}-v_{t}\left(\mathbf{s}_{t} ; \beta\right)\right)^{2}\right].
Algorithm 4: PPO
    1: Initialize actor network \(\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta\right)\)
    2: Initialize critic network \(v\left(\mathbf{s}_{t} ; \beta\right)\)
    3: For episode \(\leftarrow 1\) to \(M\) do
    4: \(\quad\) Collect set of trajectories \(\left(\mathbf{s}_{t}, a_{t}, r_{t}\right)\) by implementing policy \(\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta\right)\)
    5: Estimate advantages \(\hat{A}_{t}=\sum_{t^{\prime}>t} r_{t^{\prime}}-v\left(\mathbf{s}_{t} ; \beta\right)\)
    6: \(\quad \theta_{\text {old }} \leftarrow \theta\)
    7: \(\quad\) For \(t \leftarrow 1\) to \(T\) do
    8: \(\quad\) Evaluate probability ratio \(r_{t}(\theta)=\pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta\right) / \pi\left(\boldsymbol{a}_{t} \mid \mathbf{s}_{t} ; \theta_{\text {old }}\right)\)
    9: \(\quad\) Evaluate clipped surrogate objective \(L(\theta)=\hat{\mathrm{E}}_{t}\left[\min \left\{r_{t}(\theta) A_{t}, f_{\text {clip }}\left(r_{t}(\theta), 1-\delta, 1+\delta\right)\right\} \hat{A}_{t}\right]\)
10: Perform a gradient descent on \(L(\theta)\) with respect to the network parameters \(\theta\)
11: End For
12: \(\quad\) Evaluate loss function of the critic network \(L(\beta)=\hat{\mathrm{E}}_{t}\left[\left(\sum_{t^{\prime}>t} r_{t} r^{\prime}-v_{t}\left(\mathbf{s}_{t} ; \beta\right)\right)^{2}\right]\)
13: Perform a gradient descent on \(L(\beta)\) with respect to the network parameters \(\beta\)
14: End For
Fig. 10. The pseudo-code of the PPO algorithm.
图 10. PPO 算法的伪代码。
Fig. 11. Configuration of the manufacturing system.
图 11. 制造系统的配置。
Table 2 表 2
The state transition probability of each machine.
每台机器的状态转移概率。
No. 不。 p 2 , 2 k p 2 , 2 k p_(2,2)^(k)p_{2,2}^{k} p 2 , 1 k p 2 , 1 k p_(2,1)^(k)p_{2,1}^{k} p 3 , 3 k p 3 , 3 k p_(3,3)^(k)p_{3,3}^{k} p 3 , 2 k p 3 , 2 k p_(3,2)^(k)p_{3,2}^{k} p 3 , 1 k p 3 , 1 k p_(3,1)^(k)p_{3,1}^{k} p 4 , 4 k p 4 , 4 k p_(4,4)^(k)p_{4,4}^{k} p 4 , 3 k p 4 , 3 k p_(4,3)^(k)p_{4,3}^{k} p 4 , 2 k p 4 , 2 k p_(4,2)^(k)p_{4,2}^{k} p 4 , 1 k p 4 , 1 k p_(4,1)^(k)p_{4,1}^{k}
1 0.80 0.20 0.85 0.10 0.05 0.90 0.05 0.03 0.02
2 0.75 0.25 0.80 0.10 0.10 0.90 0.05 0.03 0.02
3 0.85 0.15 0.85 0.10 0.05 0.90 0.06 0.04 0
4 0.75 0.25 0.78 0.12 0.10 0.85 0.05 0.05 0.05
5 0.85 0.15 0.90 0.05 0.05 0.92 0.05 0.03 0
No. p_(2,2)^(k) p_(2,1)^(k) p_(3,3)^(k) p_(3,2)^(k) p_(3,1)^(k) p_(4,4)^(k) p_(4,3)^(k) p_(4,2)^(k) p_(4,1)^(k) 1 0.80 0.20 0.85 0.10 0.05 0.90 0.05 0.03 0.02 2 0.75 0.25 0.80 0.10 0.10 0.90 0.05 0.03 0.02 3 0.85 0.15 0.85 0.10 0.05 0.90 0.06 0.04 0 4 0.75 0.25 0.78 0.12 0.10 0.85 0.05 0.05 0.05 5 0.85 0.15 0.90 0.05 0.05 0.92 0.05 0.03 0| No. | $p_{2,2}^{k}$ | $p_{2,1}^{k}$ | $p_{3,3}^{k}$ | $p_{3,2}^{k}$ | $p_{3,1}^{k}$ | $p_{4,4}^{k}$ | $p_{4,3}^{k}$ | $p_{4,2}^{k}$ | $p_{4,1}^{k}$ | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 1 | 0.80 | 0.20 | 0.85 | 0.10 | 0.05 | 0.90 | 0.05 | 0.03 | 0.02 | | 2 | 0.75 | 0.25 | 0.80 | 0.10 | 0.10 | 0.90 | 0.05 | 0.03 | 0.02 | | 3 | 0.85 | 0.15 | 0.85 | 0.10 | 0.05 | 0.90 | 0.06 | 0.04 | 0 | | 4 | 0.75 | 0.25 | 0.78 | 0.12 | 0.10 | 0.85 | 0.05 | 0.05 | 0.05 | | 5 | 0.85 | 0.15 | 0.90 | 0.05 | 0.05 | 0.92 | 0.05 | 0.03 | 0 |
The pseudo-code of the PPO algorithm is shown in Fig. 10. Lines 1-2 initialize the actor and critic networks. Lines 3-14 illustrate the training processes for the actor and critic networks based on interactions with the environment. The code for implementing the PPO algorithm can be found in Figs. B.1-B. 3 in Appendix B of the supplementary materials.
PPO 算法的伪代码如图 10 所示。第 1-2 行初始化演员和评论家网络。第 3-14 行展示了基于与环境交互的演员和评论家网络的训练过程。实现 PPO 算法的代码可以在补充材料附录 B 的图 B.1-B.3 中找到。

4.3. Case study 4.3. 案例研究

A manufacturing system example is used to illustrate the RL algorithms. The manufacturing system consists of two processing stations and an assembly station, as shown in Fig. 11. Each processing station contains two machines to process the raw workpieces, and the work-inprocess is final assembly into products in assembly station. For each machine, the production capacity degrades with operation. Each machine in processing stations possesses four states with production capacity of 0 , 2 , 3 0 , 2 , 3 0,2,30,2,3, and 4 batches per day, and the assembly station possesses four states with production capacity of 0 , 4 , 6 0 , 4 , 6 0,4,60,4,6, and 8 batches per day. The degradation profile of each machine is characterized by a homogenous discrete-state Markov chain, and the transition probabilities are given in Table 2 . The production capacity of the entire manufacturing system is depended on the system structure function and states of machines, and it is formulated as:
一个制造系统的例子用于说明强化学习算法。该制造系统由两个加工站和一个装配站组成,如图 11 所示。每个加工站包含两台机器来处理原材料,工件在装配站进行最终组装成产品。对于每台机器,生产能力随着操作而下降。加工站中的每台机器具有四个状态,生产能力为 0 , 2 , 3 0 , 2 , 3 0,2,30,2,3 ,每天 4 批次,而装配站具有四个状态,生产能力为 0 , 4 , 6 0 , 4 , 6 0,4,60,4,6 ,每天 8 批次。每台机器的退化特征由一个均匀的离散状态马尔可夫链描述,转移概率见表 2。整个制造系统的生产能力依赖于系统结构函数和机器的状态,公式如下:

G t = min { G 1 , t + G 2 , t , G 3 , t + G 4 , t , G 5 , t } G t = min G 1 , t + G 2 , t , G 3 , t + G 4 , t , G 5 , t G_(t)=min{G_(1,t)+G_(2,t),G_(3,t)+G_(4,t),G_(5,t)}G_{t}=\min \left\{G_{1, t}+G_{2, t}, G_{3, t}+G_{4, t}, G_{5, t}\right\}.
The maintenance setup cost c s c s c_(s)c_{s} is 3 × 10 3 3 × 10 3 3xx10^(3)3 \times 10^{3} US dollars. The corrective maintenance and preventive maintenance costs of each machine are given in Table 3. The profit of producing one batch of product is 1 × 10 3 1 × 10 3 1xx10^(3)1 \times 10^{3} US dollars. The goal is to maximize the total production profit while minimize the maintenance cost over the next two months.
维护设置成本 c s c s c_(s)c_{s} 3 × 10 3 3 × 10 3 3xx10^(3)3 \times 10^{3} 美元。每台机器的纠正维护和预防维护成本见表 3。生产一批产品的利润为 1 × 10 3 1 × 10 3 1xx10^(3)1 \times 10^{3} 美元。目标是在接下来的两个月内最大化总生产利润,同时最小化维护成本。
The value iteration, Q-learning, and PPO algorithms are utilized to solve the CBM problem of the manufacturing system. The parameter
值迭代、Q 学习和 PPO 算法被用于解决制造系统的 CBM 问题。参数
Table 3 表 3
Parameter settings of machines (unit: × 10 3 × 10 3 xx10^(3)\times 10^{3} US dollars).
机器的参数设置(单位: × 10 3 × 10 3 xx10^(3)\times 10^{3} 美元)。
No. 不。 c k CM c k CM c_(k)^(CM)c_{k}^{\mathrm{CM}} c k PM c k PM c_(k)^(PM)c_{k}^{\mathrm{PM}}
1 5 3
2 5 2.5
3 6 3.5
4 4 2
5 9 5
No. c_(k)^(CM) c_(k)^(PM) 1 5 3 2 5 2.5 3 6 3.5 4 4 2 5 9 5| No. | $c_{k}^{\mathrm{CM}}$ | $c_{k}^{\mathrm{PM}}$ | | :--- | :--- | :--- | | 1 | 5 | 3 | | 2 | 5 | 2.5 | | 3 | 6 | 3.5 | | 4 | 4 | 2 | | 5 | 9 | 5 |
Table 4 表 4
Comparative results of different algorithms.
不同算法的比较结果。
Algorithm 算法 v ( s 0 ) ( × 10 5 v s 0 × 10 5 v(s_(0))(xx10^(5):}v\left(\mathbf{s}_{0}\right)\left(\times 10^{5}\right. US dollars ) ) ))  v ( s 0 ) ( × 10 5 v s 0 × 10 5 v(s_(0))(xx10^(5):}v\left(\mathbf{s}_{0}\right)\left(\times 10^{5}\right. 美元 ) ) )) CPU time ( s ) ( s ) (s)(\mathrm{s}) CPU 时间 ( s ) ( s ) (s)(\mathrm{s})
Value iteration 值迭代 2.04 6053
Q-learning Q 学习 1.59 1013
PPO 1.95 418
Algorithm v(s_(0))(xx10^(5):} US dollars ) CPU time (s) Value iteration 2.04 6053 Q-learning 1.59 1013 PPO 1.95 418| Algorithm | $v\left(\mathbf{s}_{0}\right)\left(\times 10^{5}\right.$ US dollars $)$ | CPU time $(\mathrm{s})$ | | :--- | :--- | :--- | | Value iteration | 2.04 | 6053 | | Q-learning | 1.59 | 1013 | | PPO | 1.95 | 418 |
settings of these algorithms are as follow: 1) value iteration: the threshold of is 0.001 ; 2 ) 0.001 ; 2 ) 0.001;2)0.001 ; 2) Q-learning: the number of iterations is 100,000 , the parameters of the ε ε epsi\varepsilon-greedy strategy are ε min = 0.01 , ε max = 1.0 ε min = 0.01 , ε max = 1.0 epsi_(min)=0.01,epsi_(max)=1.0\varepsilon_{\min }=0.01, \varepsilon_{\max }=1.0, and Δ ε = 0.005 Δ ε = 0.005 Delta epsi=0.005\Delta \varepsilon=0.005; 3) PPO: the actor network and critic network are both consist of two hidden layers with 16 neurons, the number of iterations is 20,000, the learning rates of actor network and critic network are 0.001 and 0.003 , respectively, the clipping ratio is set to 0.1 , and the iterations at each update of networks is K = 80 K = 80 K=80K=80. The code for the algorithm’s training process can be found in Fig. B.4, and the code for the simulation environment of the manufacturing system CBM problem is given in Fig. B. 5 in Appendix B of the supplementary materials.
这些算法的设置如下:1)值迭代:阈值为 0.001 ; 2 ) 0.001 ; 2 ) 0.001;2)0.001 ; 2) ;Q 学习:迭代次数为 100,000, ε ε epsi\varepsilon -贪婪策略的参数为 ε min = 0.01 , ε max = 1.0 ε min = 0.01 , ε max = 1.0 epsi_(min)=0.01,epsi_(max)=1.0\varepsilon_{\min }=0.01, \varepsilon_{\max }=1.0 Δ ε = 0.005 Δ ε = 0.005 Delta epsi=0.005\Delta \varepsilon=0.005 ;3)PPO:演员网络和评论家网络均由两个包含 16 个神经元的隐藏层组成,迭代次数为 20,000,演员网络和评论家网络的学习率分别为 0.001 和 0.003,剪切比率设置为 0.1,每次网络更新的迭代次数为 K = 80 K = 80 K=80K=80 。算法训练过程的代码见图 B.4,制造系统 CBM 问题的仿真环境代码见附录 B 的图 B.5。
The comparative results in terms of expected total profit and computational time of these algorithms are shown in Table 4. Different from the mission abort problem in Section 3.3, the state and action spaces of the maintenance optimization problem for a four-machine manufacturing system are significantly increased, and the value iteration algorithm will consume a relatively long computational time. Qlearning and PPO algorithms are more efficient than value iteration algorithm since they learn the optimal policy through continuous interactions between the agent and environment rather than iteratively estimating all the states. Moreover, the PPO algorithm leverages neural networks to approximate the policy function and the value function can further improve the solution effectiveness and efficiency in large-scale problems.
这些算法在预期总利润和计算时间方面的比较结果如表 4 所示。与 3.3 节中的任务中止问题不同,四机制造系统的维护优化问题的状态和动作空间显著增加,价值迭代算法将消耗相对较长的计算时间。Q 学习和 PPO 算法比价值迭代算法更高效,因为它们通过智能体与环境之间的持续交互学习最优策略,而不是迭代估计所有状态。此外,PPO 算法利用神经网络来近似策略函数,价值函数的使用可以进一步提高大规模问题的解决效果和效率。
To demonstrate the performance of the three algorithms on different problem scales, a sensitivity analysis is conducted on the number of components N N NN. The dimension of the state and action space of the CBM models and the computational time of the three algorithms for different settings of N N NN are tabulated in Tables 5 and 6 , respectively. It can be seen that the state and action space increase exponentially with the number of components. The DP algorithm has high computational efficiency for small-scale instances, while its efficiency decreases significantly as the system scale increases. The computational time for the Q-learning and PPO algorithms is relatively insensitive to problem scale, as they learn the optimal policy through interactions with the environment without
为了展示三种算法在不同问题规模上的性能,对组件数量进行了敏感性分析 N N NN 。CBM 模型的状态和动作空间的维度以及三种算法在不同设置下的计算时间分别列在表 5 和表 6 中。可以看出,状态和动作空间随着组件数量的增加呈指数增长。DP 算法在小规模实例中具有较高的计算效率,但随着系统规模的增加,其效率显著下降。Q 学习和 PPO 算法的计算时间对问题规模相对不敏感,因为它们通过与环境的交互学习最优策略。
Table 5 表 5
Dimension of the state and action space on different settings of N N NN.
在不同设置下的状态和动作空间的维度 N N NN
N = 2 N = 2 N=2N=2 N = 3 N = 3 N=3N=3 N = 4 N = 4 N=4N=4 N = 5 N = 5 N=5N=5 N = 6 N = 6 N=6N=6
| S | | S | |S||\mathscr{S}| 800 3200 12,800 51,200 204,800
| A | | A | |A||\mathscr{A}| 4 8 16 32 64
N=2 N=3 N=4 N=5 N=6 |S| 800 3200 12,800 51,200 204,800 |A| 4 8 16 32 64| | $N=2$ | $N=3$ | $N=4$ | $N=5$ | $N=6$ | | :--- | :--- | :--- | :--- | :--- | :--- | | $\|\mathscr{S}\|$ | 800 | 3200 | 12,800 | 51,200 | 204,800 | | $\|\mathscr{A}\|$ | 4 | 8 | 16 | 32 | 64 |
Table 6 表 6
Computational time of the three algorithms on different settings of N N NN.
三种算法在不同设置下的计算时间 N N NN
Algorithm 算法 CPU time (s) CPU 时间(秒)
N = 2 N = 2 N=2N=2 N = 3 N = 3 N=3N=3 N = 4 N = 4 N=4N=4 N = 5 N = 5 N=5N=5 N = 6 N = 6 N=6N=6
Value iteration 值迭代 5.90 54.50 542.72 6053 72,249
Q-learning Q 学习 990 1006 1019 1025 1046
PPO 385 396 402 418 442
Algorithm CPU time (s) N=2 N=3 N=4 N=5 N=6 Value iteration 5.90 54.50 542.72 6053 72,249 Q-learning 990 1006 1019 1025 1046 PPO 385 396 402 418 442| Algorithm | CPU time (s) | | | | | | :--- | :--- | :--- | :--- | :--- | :--- | | | $N=2$ | $N=3$ | $N=4$ | $N=5$ | $N=6$ | | Value iteration | 5.90 | 54.50 | 542.72 | 6053 | 72,249 | | Q-learning | 990 | 1006 | 1019 | 1025 | 1046 | | PPO | 385 | 396 | 402 | 418 | 442 |
Table 7 表 7
Relative errors of results of the Q-learning and PPO algorithms on different settings of N N NN.
Q 学习和 PPO 算法在不同设置下的结果相对误差 N N NN
Algorithm 算法 Relative error 相对误差
N = 2 N = 2 N=2N=2 N = 3 N = 3 N=3N=3 N = 4 N = 4 N=4N=4 N = 5 N = 5 N=5N=5 N = 6 N = 6 N=6N=6
Q-learning Q 学习 8.14 % 8.14 % 8.14%8.14 \% 12.06 % 12.06 % 12.06%12.06 \% 16.94 % 16.94 % 16.94%16.94 \% 22.05 % 22.05 % 22.05%22.05 \% 28.67 % 28.67 % 28.67%28.67 \%
PPO 2.96 % 2.96 % 2.96%2.96 \% 3.47 % 3.47 % 3.47%3.47 \% 3.94 % 3.94 % 3.94%3.94 \% 4.41 % 5.18 % 5.18 % 5.18%5.18 \%
Algorithm Relative error N=2 N=3 N=4 N=5 N=6 Q-learning 8.14% 12.06% 16.94% 22.05% 28.67% PPO 2.96% 3.47% 3.94% 4.41 % 5.18%| Algorithm | Relative error | | | | | | :---: | :---: | :---: | :---: | :---: | :---: | | | $N=2$ | $N=3$ | $N=4$ | $N=5$ | $N=6$ | | Q-learning | $8.14 \%$ | $12.06 \%$ | $16.94 \%$ | $22.05 \%$ | $28.67 \%$ | | PPO | $2.96 \%$ | $3.47 \%$ | $3.94 \%$ | 4.41 % | $5.18 \%$ |
enumerating all states. As the DP algorithm can accurately estimate the optimal results, we use the relative errors of the Q-learning and PPO algorithms compared to DP as indicators of algorithm performance. The relative errors of the Q-learning and PPO algorithms for different settings of N N NN are shown in Table 7. It can be seen that the relative error for the PPO algorithm is always less than 6 % 6 % 6%6 \%, indicating its superior performance across different problem scales. In contrast, the performance of the Q-learning algorithm decreases significantly as the problem scale increases. From Tables 6, 7, it is observed that the DP algorithm has a significant advantage in small-scale problems, but it relies on an accurate model of the environment. In contrast, DRL algorithms outperformed the other algorithms in large-scale problems due to their higher computational efficiency and accuracy.
枚举所有状态。由于动态规划(DP)算法能够准确估计最优结果,我们使用 Q 学习和 PPO 算法相对于 DP 的相对误差作为算法性能的指标。不同设置下 Q 学习和 PPO 算法的相对误差如表 7 所示。可以看出,PPO 算法的相对误差始终小于 1,表明其在不同问题规模下的优越性能。相比之下,Q 学习算法的性能随着问题规模的增加而显著下降。从表 6 和表 7 可以观察到,DP 算法在小规模问题中具有显著优势,但它依赖于对环境的准确建模。相比之下,深度强化学习(DRL)算法在大规模问题中由于其更高的计算效率和准确性而优于其他算法。

5. Research challenges and future directions
5. 研究挑战与未来方向

Even though many achievements have been made in RL for reliability and maintenance optimization, there are still several challenges that drive future research.
尽管在可靠性和维护优化的强化学习(RL)方面取得了许多成就,但仍然存在一些挑战推动未来的研究。

(1) Existing RL models for maintenance planning mainly focus on singe-agent environment. In many practical scenarios, multiple agents need to collaborate to achieve the common maintenance goal Multi-agent RL (MARL) algorithms can provide an efficient way to train optimal maintenance policies for each agent. However, the introduction of MARL to solve reliability and maintenance optimization problems will also introduce new challenges. The complexity of interactions between multiple agents leads to non-stationary environments, making the learning process unstable and unpredictable. Moreover, coordination among agents to achieve a common goal requires complex communication protocols and synchronization mechanisms.
现有的维护规划强化学习(RL)模型主要集中在单一代理环境中。在许多实际场景中,多个代理需要协作以实现共同的维护目标。多代理强化学习(MARL)算法可以为每个代理训练最佳维护策略提供有效的方法。然而,引入 MARL 来解决可靠性和维护优化问题也会带来新的挑战。多个代理之间的交互复杂性导致了非平稳环境,使得学习过程不稳定且不可预测。此外,代理之间为了实现共同目标的协调需要复杂的通信协议和同步机制。

(2) The inherent limitation of observation methods and precision in engineering contexts often renders system states partially observable, complicating effective decision-making. Under such a scenario, partially observable Markov decision process (POMDP) has been introduced. However, the computational complexity of solving POMDPs further complicates their application, requiring extensive computational resources and sophisticated algorithms to approximate optimal policy. To address this, future efforts could focus on methods for accurately estimating hidden states and decision-making under partial observability environments, such as belief state representation and approximate inference.
(2)观察方法和工程环境中精度的固有限制常常使系统状态部分可观,这使得有效决策变得复杂。在这种情况下,引入了部分可观马尔可夫决策过程(POMDP)。然而,解决 POMDP 的计算复杂性进一步复杂化了它们的应用,要求大量的计算资源和复杂的算法来逼近最优策略。为了解决这个问题,未来的努力可以集中在准确估计隐藏状态和在部分可观环境下进行决策的方法上,例如信念状态表示和近似推理。

(3) Traditional MDP models assume precise knowledge of system dynamics, i.e., the precise parameters of system degradation models, which is rarely the case in real-world applications. However, cognitive uncertainty may arise regarding the system parameters due to incomplete or imperfect knowledge on a new system, which can significantly affect decision-making processes. To address cognitive uncertainty, future research should focus on robust decision-making methods that can effectively handle imprecise system parameters. This could involve incorporating techniques such as robust optimization and Bayesian methods to model MDP frameworks that accommodate to cognitive uncertainty, thereby enabling a robust decision-making.
(3)传统的马尔可夫决策过程(MDP)模型假设对系统动态有精确的了解,即系统退化模型的精确参数,但在实际应用中这种情况很少出现。然而,由于对新系统知识的不完整或不完美,可能会出现关于系统参数的认知不确定性,这会显著影响决策过程。为了解决认知不确定性,未来的研究应集中于能够有效处理不精确系统参数的稳健决策方法。这可能涉及将稳健优化和贝叶斯方法等技术纳入 MDP 框架,以适应认知不确定性,从而实现稳健的决策。

(4) While model-free RL methods are flexible and can learn directly from interactions with the environment, their performance is often compromised if the simulation models are not accurate. This dependency on the accuracy of simulation models used during training can lead to suboptimal policies that perform poorly in real-world scenarios. To mitigate this, future research should focus on integrating model-based and model-free RL approaches. Combining the strengths of both methods could involve dynamically learning and refining the system model from interactions during the training process, ultimately leading to more accurate and effective RL algorithms for reliability and maintenance optimization. This hybrid framework would leverage the guidance provided by approximate models and the adaptability of model-free methods, offering a promising solution to the challenges faced in reliability and maintenance optimization.
(4)虽然无模型强化学习方法灵活且能够直接从与环境的交互中学习,但如果所使用的仿真模型不准确,其性能往往会受到影响。这种对训练过程中所用仿真模型准确性的依赖可能导致在现实场景中表现不佳的次优策略。为此,未来的研究应集中于整合基于模型和无模型的强化学习方法。结合这两种方法的优势可能涉及在训练过程中动态学习和完善系统模型,从而最终导致更准确和有效的强化学习算法,以优化可靠性和维护。这种混合框架将利用近似模型提供的指导和无模型方法的适应性,为解决可靠性和维护优化中面临的挑战提供了一个有前景的解决方案。

(5) RL agent learns the regression rule between the observation and action by continuous interaction with the environment. It is a typical black-box model that suffers from low interpretability. This brings huge challenges for the implementation of RL-based optimization methods in practice, especially in applications with high security performance. To enhance the interpretability of the RL method, physics-informed RL methods that can integrate physical information and domain knowledge, are potential research directions in future. In addition, safe RL and constrained RL methods that can guarantee the safety of the policy during both the training and test stage, is also a research direction worth exploring in future.
(5)强化学习代理通过与环境的持续互动学习观察与行动之间的回归规则。这是一种典型的黑箱模型,存在低可解释性的问题。这给基于强化学习的优化方法在实践中的实施带来了巨大的挑战,尤其是在高安全性能的应用中。为了增强强化学习方法的可解释性,能够整合物理信息和领域知识的物理信息强化学习方法,是未来潜在的研究方向。此外,能够在训练和测试阶段保证策略安全的安全强化学习和约束强化学习方法,也是未来值得探索的研究方向。

(6) Existing RL methods typically follow an offline training and online execution manner, where the policy learned during the offline stage is directly tested on the practical system in the online phase. However, with the increasing complexity and continuous evolution of the engineered systems, classical RL methods developed on the basis of the offline-training-online-execution framework struggle to deal with the continuously generated unmolded dynamics of the environment. Therefore, one promising future direction is to integrate the meta learning and continuous learning with RL algorithm to enhance its adaptability to system dynamics. This allows a continuous transformation of online data to valuable knowledge that can successfully improve the adaptability and robustness of RL agent.
现有的强化学习方法通常遵循离线训练和在线执行的方式,其中在离线阶段学习到的策略会直接在在线阶段的实际系统中进行测试。然而,随着工程系统的复杂性不断增加和持续演变,基于离线训练-在线执行框架开发的经典强化学习方法在处理环境中不断生成的未成型动态时面临困难。因此,一个有前景的未来方向是将元学习和持续学习与强化学习算法相结合,以增强其对系统动态的适应性。这允许在线数据的持续转化为有价值的知识,从而成功提高强化学习智能体的适应性和鲁棒性。

6. Conclusion 6. 结论

This tutorial introduces the application of RL in the context of reliability and maintenance optimization for engineered systems. We provided an overview of the theoretical foundations of RL and meticulously detailed the processes involved in MDP formulation, algorithm design, and application, offering a comprehensive guide for employing RL to facilitate reliability and maintenance optimization policies. Through the case studies on a gas pipeline system’s mission abort scenario and a manufacturing system’s maintenance optimization, the potential of RL algorithms to enhance decision-making in complex engineering scenarios has been illustrated. The findings through the case studies revealed that while DP algorithms like value iteration offer rapid solutions in settings with limited state and action spaces, DRL, such as Dueling DQN and PPO, are more adept at handling the “curse of dimensionality” inherent in more intricate systems. However, despite the promising performance of DRL algorithms, notable challenges
本教程介绍了在工程系统的可靠性和维护优化背景下强化学习(RL)的应用。我们提供了 RL 理论基础的概述,并详细阐述了马尔可夫决策过程(MDP)制定、算法设计和应用的相关过程,提供了一个全面的指南,以便利用 RL 促进可靠性和维护优化政策。通过对天然气管道系统的任务中止场景和制造系统的维护优化的案例研究,展示了 RL 算法在复杂工程场景中增强决策能力的潜力。案例研究的结果表明,尽管动态规划(DP)算法如价值迭代在状态和动作空间有限的环境中提供快速解决方案,但深度强化学习(DRL),如对抗 DQN 和 PPO,更擅长处理更复杂系统中固有的“维度诅咒”。然而,尽管 DRL 算法表现出良好的性能,但仍面临显著挑战。

persist, particularly in the design of computationally efficient algorithms capable of navigating the vast decision spaces of reliability and maintenance optimization problems where “curse of dimensionality” remains a significant hurdle.
在设计能够有效导航可靠性和维护优化问题的广泛决策空间的计算高效算法时,尤其要坚持,因为“维度诅咒”仍然是一个重要的障碍。

CRediT authorship contribution statement
CRediT 作者贡献声明

Qin Zhang: Writing - original draft, Methodology, Investigation, Formal analysis. Yu Liu: Writing - review & editing, Supervision, Resources, Project administration, Funding acquisition, Conceptualization. Yisha Xiang: Writing - review & editing, Investigation, Formal analysis. Tangfan Xiahou: Writing - review & editing.
秦张:写作 - 原始草稿,方法论,调查,正式分析。余刘:写作 - 审阅与编辑,监督,资源,项目管理,资金获取,概念化。夏侯唐凡:写作 - 审阅与编辑。向怡莎:写作 - 审阅与编辑,调查,正式分析。

Declaration of competing interest
竞争利益声明

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者声明,他们没有已知的竞争性财务利益或个人关系,这些关系可能会影响本论文中报告的工作。

Data availability 数据可用性

Data will be made available on request.
数据将在请求时提供。

Acknowledgment 致谢

The authors Yu Liu, Qin Zhang, and Tangfan Xiahou greatly acknowledge grant support from the National Natural Science Foundation of China under contract numbers 72271044 and 72331002.
作者刘宇、张琴和夏侯唐帆非常感谢中国国家自然科学基金的资助,合同编号为 72271044 和 72331002。

Supplementary materials 补充材料

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.ress.2024.110401.
与本文相关的补充材料可以在在线版本中找到,网址为 doi:10.1016/j.ress.2024.110401。

References 参考文献

[1] Coit DW, Zio E. The evolution of system reliability optimization. Reliab Eng Syst Saf 2019;192:106259.
[1] Coit DW, Zio E. 系统可靠性优化的发展. 可靠性工程与系统安全 2019;192:106259.

[2] Ouyang Z, Liu Y, Ruan SJ, et al. An improved particle swarm optimization algorithm for reliability-redundancy allocation problem with mixed redundancy strategy and heterogeneous components. Reliab Eng Syst Saf 2019;181:62-74.
[2] 欧阳志, 刘阳, 阮世杰, 等. 一种改进的粒子群优化算法用于具有混合冗余策略和异构组件的可靠性冗余分配问题. 可靠性工程与系统安全 2019;181:62-74.

[3] Zhang N, Tian S, Cai KQ, et al. Condition-based maintenance assessment for a deteriorating system considering stochastic failure dependence. IISE Trans 2023;55 (7):687-97.
[3] 张宁, 田松, 蔡克强, 等. 考虑随机故障依赖的恶化系统的基于状态的维护评估. IISE 交易 2023;55 (7):687-97.

[4] Levitin G, Finkelstein M, Xiang YP. Optimal mission abort policies for repairable multistate systems performing multi-attempt mission. Reliab Eng Syst Saf 2021; 209:107497.
[4] Levitin G, Finkelstein M, Xiang YP. 可修复多状态系统执行多次尝试任务的最佳任务中止策略。可靠性工程与系统安全 2021; 209:107497。

[5] Ma C, Wang Q, Cai Z, et al. Component reassignment for reliability optimization of reconfigurable systems considering component degradation. Reliab Eng Syst Saf 2021;215:107867.
[5] 马超, 王琦, 蔡志, 等. 考虑组件退化的可重构系统可靠性优化的组件重新分配. 可靠性工程与系统安全 2021;215:107867.

[6] Shi Y, Zhu W, Xiang Y, et al. Condition-based maintenance optimization for multicomponent systems subject to a system reliability requirement. Reliab Eng Syst Saf 2020;202:107042.
[6] 石勇, 朱伟, 向阳, 等. 基于状态的多组件系统维护优化,满足系统可靠性要求. 可靠性工程与系统安全 2020;202:107042.

[7] Chen Y, Liu Y, Xiahou T. Dynamic inspection and maintenance scheduling for multi-state systems under time-varying demand: proximal policy optimization. IISE Trans 2023. https://doi.org/10.1080/24725854.2023.2259949.
[7] 陈勇, 刘阳, 夏侯涛. 基于时间变化需求的多状态系统动态检验与维护调度:近端策略优化. IISE 交易 2023. https://doi.org/10.1080/24725854.2023.2259949.

[8] Guo C, Liang Z. A predictive Markov decision process for optimizing inspection and maintenance strategies of partially observable multi-state systems. Reliab Eng Syst Saf 2022;226:108683.
[8] 郭超, 梁志. 一种预测性马尔可夫决策过程用于优化部分可观测多状态系统的检查和维护策略. 可靠性工程与系统安全 2022;226:108683.

[9] Gosavi A. Reinforcement learning: a tutorial survey and recent advances. INFORMS J Comput 2009;21(2):178-92.
[9] Gosavi A. 强化学习:教程调查与近期进展。INFORMS 计算期刊 2009;21(2):178-92。

[10] Barto AG, Sutton RS, Anderson CW. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 1983;(5):834-46.
[10] Barto AG, Sutton RS, Anderson CW. 类神经元自适应元素能够解决困难的学习控制问题。IEEE 系统、人与控制论汇刊 1983;(5):834-46。

[11] Liu Y, Gao J, Jiang T, et al. Selective maintenance and inspection optimization for partially observable systems: an interactively sequential decision framework. IISE Trans 2022;55(5):463-79.
[11] 刘勇, 高杰, 蒋涛, 等. 部分可观测系统的选择性维护和检查优化:一种交互式序贯决策框架. IISE 交易 2022;55(5):463-79.

[12] Garcia F, Rachelson E. Markov decision processes. Markov decision processes in artificial intelligence. 2013. p. 1-38.
[12] 加西亚 F, 拉切尔森 E. 马尔可夫决策过程. 人工智能中的马尔可夫决策过程. 2013. 页 1-38.

[13] Sutton RS, Barto AG. Reinforcement learning: an introduction. MIT Press; 2018.
[13] Sutton RS, Barto AG. 强化学习:导论。麻省理工学院出版社;2018。

[14] Chen Y, Liu Y, Xiahou T. A deep reinforcement learning approach to dynamic loading strategy of repairable multistate systems. IEEE Trans Reliab 2022;71(1): 484-99.
[14] 陈勇, 刘阳, 夏侯涛. 一种深度强化学习方法用于可修复多状态系统的动态加载策略. IEEE 可靠性学报 2022;71(1): 484-99.

[15] Xie J, Liu Y, Chen N. Two-sided deep reinforcement learning for dynamic mobilityon-demand management with mixed autonomy. Transp Sci 2023;57(4):1019-46.
[15] 谢杰, 刘阳, 陈宁. 双向深度强化学习用于混合自主的动态按需出行管理. 交通科学 2023;57(4):1019-46.

[16] Niv Y. Reinforcement learning in the brain. J Math Psychol 2009;53(3):139-54.
[16] Niv Y. 大脑中的强化学习. 数学心理学杂志 2009;53(3):139-54.

[17] Sutton RS. Learning to predict by the methods of temporal differences. Mach Learn 1988;3:9-44.
[17] Sutton RS. 通过时间差分方法学习预测。机器学习 1988;3:9-44。

[18] Dangut MD, Jennions IK, King S, et al. Application of deep reinforcement learning for extremely rare failure prediction in aircraft maintenance. Mech Syst Signal Process 2022;171:108873.
[18] Dangut MD, Jennions IK, King S, 等. 深度强化学习在飞机维护中极为罕见故障预测的应用. 机械系统与信号处理 2022;171:108873.

[19] Botvinick M, Wang JX, Dabney W, et al. Deep reinforcement learning and its neuroscientific implications. Neuron 2020;107(4):603-16.
[19] Botvinick M, Wang JX, Dabney W, 等. 深度强化学习及其神经科学意义. 神经元 2020;107(4):603-16.

[20] Arulkumaran K, Deisenroth MP, Brundage M, et al. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 2017;34(6):26-38.
[20] Arulkumaran K, Deisenroth MP, Brundage M, 等. 深度强化学习:简要调查. IEEE 信号处理杂志 2017;34(6):26-38.

[21] Wang W, Li B, Luo X, et al. Deep reinforcement learning for sequential targeting. Manag Sci 2023;69(9):5439-60.
[21] 王伟, 李斌, 罗晓, 等. 深度强化学习用于序列目标定位. 管理科学 2023;69(9):5439-60.

[22] Nian R, Liu J, Huang B. A review on reinforcement learning: introduction and applications in industrial process control. Comput Chem Eng 2020;139:106886.
[22] Nian R, Liu J, Huang B. 强化学习综述:在工业过程控制中的介绍与应用。计算化学工程 2020;139:106886。

[23] Hao S, Zheng J, Yang J, et al. Deep reinforce learning for joint optimization of condition-based maintenance and spare ordering. Inf Sci 2023;634:85-100.
[23] Hao S, Zheng J, Yang J, 等. 基于深度强化学习的基于状态的维护与备件订购的联合优化. 信息科学 2023;634:85-100.

[24] Zhang JD, He Z, Chan WH, et al. DeepMAG: deep reinforcement learning with multi-agent graphs for flexible job shop scheduling. Knowl Based Syst 2023;259: 110083.
[24] 张 JD, 何 Z, 陈 WH, 等. DeepMAG:基于多智能体图的深度强化学习用于灵活作业车间调度. 知识基础系统 2023;259: 110083.

[25] Tesauro G. TD-Gammon, a self-teaching backgammon program, achieves masterlevel play. Neural Comput 1994;6(2):215-9.
[25] Tesauro G. TD-Gammon,一个自我学习的西洋双陆棋程序,达到了大师级水平。神经计算 1994;6(2):215-9。

[26] V. Mnih, K. Kavukcuoglu, D. Silver, et al. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[26] V. Mnih, K. Kavukcuoglu, D. Silver 等. 使用深度强化学习玩 Atari 游戏. arXiv 预印本 arXiv:1312.5602, 2013.

[27] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature 2015;518(7540):529-33.
[27] Mnih V, Kavukcuoglu K, Silver D, 等. 通过深度强化学习实现人类水平的控制. 自然 2015;518(7540):529-33.

[28] Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016;529(7587):484-9.
[28] Silver D, Huang A, Maddison CJ, 等. 利用深度神经网络和树搜索掌握围棋游戏. 自然 2016;529(7587):484-9.

[29] C. Berner, G. Brockman, B. Chan, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv: 1912.06680, 2019.
[29] C. Berner, G. Brockman, B. Chan 等. 使用大规模深度强化学习的 Dota 2. arXiv 预印本 arXiv: 1912.06680, 2019.

[30] Vinyals O, Babuschkin I, Czarnecki WM, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019;575(7782):350-4.
[30] Vinyals O, Babuschkin I, Czarnecki WM, 等. 使用多智能体强化学习在《星际争霸 II》中达到大师级水平. 自然 2019;575(7782):350-4.

[31] Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge. Nature 2017;550(7676):354-9.
[31] Silver D, Schrittwieser J, Simonyan K, 等. 在没有人类知识的情况下掌握围棋游戏. 自然 2017;550(7676):354-9.

[32] S. Racanière, T. Weber, D. Reichert, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv: 1707.06203, 2017.
[32] S. Racanière, T. Weber, D. Reichert 等. 增强想象的深度强化学习代理. arXiv 预印本 arXiv: 1707.06203, 2017.

[33] V. Feinberg, A. Wan, I. Stoica, et al. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv: 1803.00101, 2018.
[33] V. Feinberg, A. Wan, I. Stoica 等. 基于模型的价值估计用于高效的无模型强化学习. arXiv 预印本 arXiv: 1803.00101, 2018.

[34] Qiu Q, Cui L, Wu B. Dynamic mission abort policy for systems operating in a controllable environment with self-healing mechanism. Reliab Eng Syst Saf 2020; 203:107069.
[34] 邱琦, 崔丽, 吴波. 在可控环境中具有自愈机制的系统动态任务中止策略. 可靠性工程与系统安全 2020; 203:107069.

[35] Jia S, Yuan X, Liang YC. Reconfigurable intelligent surfaces for energy efficiency in D2D communication network. IEEE Wirel Commun Lett 2021;10(3):683-7.
[35] 贾 S, 袁 X, 梁 YC. 可重构智能表面在 D2D 通信网络中的能效应用. IEEE 无线通信快报 2021;10(3):683-7.

[36] Qiu Q, Maillart LM, Prokopyev OA, et al. Optimal condition-based mission abort decisions. IEEE Trans Reliab 2022;72(1):408-25.
[36] 邱琦, 马伊拉特 LM, 普罗科皮耶夫 OA, 等. 基于最佳条件的任务中止决策. IEEE 可靠性杂志 2022;72(1):408-25.

[37] Liu L, Yang J, Yan B. A dynamic mission abort policy for transportation systems with stochastic dependence by deep reinforcement learning. Reliab Eng Syst Saf 2024 ; 241 : 109682 2024 ; 241 : 109682 2024;241:1096822024 ; 241: 109682.
[37] 刘丽, 杨杰, 闫波. 基于深度强化学习的具有随机依赖性的交通系统动态任务中止策略. 可靠性工程与系统安全 2024 ; 241 : 109682 2024 ; 241 : 109682 2024;241:1096822024 ; 241: 109682 .

[38] Malekshah S, Rasouli A, Malekshah Y, et al. Reliability-driven distribution power network dynamic reconfiguration in presence of distributed generation by the deep reinforcement learning method. Alex Eng J 2022;61(8):6541-56.
[38] Malekshah S, Rasouli A, Malekshah Y, 等. 在分布式发电存在下,基于深度强化学习方法的可靠性驱动配电电网动态重构. Alex Eng J 2022;61(8):6541-56.

[39] Zhao X, Li Y, Wang S, et al. Joint optimization of component reassignment and working intensity adjusting strategy for multi-state systems with periodic inspection in a shock environment. Reliab Eng Syst Saf 2024;245:110041.
[39] 赵 X, 李 Y, 王 S, 等. 在冲击环境下具有周期性检查的多状态系统的组件重新分配和工作强度调整策略的联合优化. 可靠性工程与系统安全 2024;245:110041.

[40] Zhao X, Sun J, Qiu Q, et al. Optimal inspection and mission abort policies for systems subject to degradation. Eur J Oper Res 2021;292(2):610-21.
[40] 赵 X, 孙 J, 邱 Q, 等. 受退化影响的系统的最佳检查和任务中止策略. 欧洲运筹学杂志 2021;292(2):610-21.

[41] Liu L, Yang J. A dynamic mission abort policy for the swarm executing missions and its solution method by tailored deep reinforcement learning. Reliab Eng Syst Saf 2023;234:109149.
[41] 刘丽, 杨杰. 一种针对执行任务的群体动态任务中止策略及其定制深度强化学习的解决方法. 可靠性工程与系统安全 2023;234:109149.

[42] Compare M, Marelli P, Baraldi P, et al. A Markov decision process framework for optimal operation of monitored multi-state systems. Proc Inst Mech Eng, Part O 2018;232(6):677-89.
[42] Compare M, Marelli P, Baraldi P, 等. 监控多状态系统的最优操作的马尔可夫决策过程框架. 机械工程师学会会刊, O 部分 2018;232(6):677-89.

[43] Liu Y, Huang HZ, Zhang X. A data-driven approach to selecting imperfect maintenance models. IEEE Trans Reliab 2011;61(1):101-12.
[43] 刘勇, 黄华志, 张晓. 一种基于数据驱动的选择不完美维护模型的方法. IEEE 可靠性学报 2011;61(1):101-12.

[44] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning. In: Proceedings of international conference on machine learning; 2016. p. 1995-2003.
[44] 王志, Schaul T, Hessel M, 等. 深度强化学习的对抗网络架构. 在:国际机器学习会议论文集; 2016. 页 1995-2003.

[45] De Jonge B, Scarf PA. A review on maintenance optimization. Eur J Oper Res 2020; 285(3):805-24.
[45] De Jonge B, Scarf PA. 维护优化的综述。欧洲运筹学杂志 2020; 285(3):805-24.

[46] Olde Keizer MCA, Flapper SDP, et al. Condition-based maintenance policies for systems with multiple dependent components: a review. Eur J Oper Res 2017;261 (2):405-20.
[46] Olde Keizer MCA, Flapper SDP 等. 具有多个依赖组件的系统的基于状态的维护政策:综述. 欧洲运筹学杂志 2017;261 (2):405-20.

[47] Alaswad S, Xiang YS. A review on condition-based maintenance optimization models for stochastically deteriorating system. Reliab Eng Syst Saf 2017;157: 54 63 54 63 54-6354-63.
[47] Alaswad S, Xiang YS. 关于随机退化系统的基于状态的维护优化模型的综述。可靠性工程与系统安全 2017;157: 54 63 54 63 54-6354-63 .

[48] Liu Y, Chen Y, Jiang T. Dynamic selective maintenance optimization for multi-state systems over a finite horizon: a deep reinforcement learning approach. Eur J Oper Res 2020;283(1):166-81.
[48] 刘毅, 陈阳, 蒋涛. 有限时间内多状态系统的动态选择性维护优化:一种深度强化学习方法. 欧洲运筹学杂志 2020;283(1):166-81.

[49] Ogunfowora O, Najjaran H. Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization. J Manuf Syst 2023;70:244-63.
[49] Ogunfowora O, Najjaran H. 基于强化学习和深度强化学习的机器维护规划、调度政策和优化解决方案。制造系统杂志 2023;70:244-63。

[50] Siraskar R, Kumar S, Patil S, et al. Reinforcement learning for predictive maintenance: a systematic technical review. Artif Intell Rev 2023;56(11): 12885-947.
[50] Siraskar R, Kumar S, Patil S, 等. 预测性维护的强化学习:系统技术评审. 人工智能评论 2023;56(11): 12885-947.

[51] Wang H, Yan Q, Zhang S. Integrated scheduling and flexible maintenance in deteriorating multi-state single machine system using a reinforcement learning approach. Ad Eng Inform 2021;49:101339.
[51] 王华, 闫强, 张三. 基于强化学习方法的恶化多状态单机系统的集成调度与灵活维护. 先进工程信息 2021;49:101339.

[52] Yan Q, Wang H, Wu F. Digital twin-enabled dynamic scheduling with preventive maintenance using a double-layer Q-learning algorithm. Comput Oper Res 2022; 144 : 105823 144 : 105823 144:105823144: 105823.
[52] 闫琼, 王辉, 吴飞. 基于数字双胞胎的动态调度与预防性维护的双层 Q 学习算法. 计算机与运作研究 2022; 144 : 105823 144 : 105823 144:105823144: 105823 .

[53] Compare M, Bellani L, Cobelli E, et al. A reinforcement learning approach to optimal part flow management for gas turbine maintenance. Proc Inst Mech Eng, Part O 2019;234(1):52-62.
[53] Compare M, Bellani L, Cobelli E, 等. 一种强化学习方法用于燃气轮机维护的最佳零件流管理. 机械工程师学会会刊, O 部分 2019;234(1):52-62.

[54] Zhao Y, Smidts C. Reinforcement learning for adaptive maintenance policy optimization under imperfect knowledge of the system degradation model and partial observability of system states. Reliab Eng Syst Saf 2022;224:108541.
[54] 赵勇, Smidts C. 在对系统退化模型的不完全知识和系统状态的部分可观测性下,强化学习用于自适应维护策略优化。可靠性工程与系统安全 2022;224:108541.

[55] Zhu Z, Xiang Y, Zhao M, et al. Data-driven remanufacturing planning with parameter uncertainty. Eur J Oper Res 2023;309(1):102-16.
[55] 朱志, 向阳, 赵明, 等. 数据驱动的再制造规划与参数不确定性. 欧洲运筹学杂志 2023;309(1):102-16.

[56] Zhang P, Zhu X, Xie M. A model-based reinforcement learning approach for maintenance optimization of degrading systems in a large state space. Comput Ind Eng 2021;161:107622.
[56] 张鹏, 朱晓, 谢明. 一种基于模型的强化学习方法用于大状态空间中退化系统的维护优化. 计算机与工业工程 2021;161:107622.

[57] Uit het Broek MAJ, Teunter RH, De Jonge B, et al. Joint condition-based maintenance and condition-based production optimization. Reliab Eng Syst Saf 2021;214:107743.
[57] Uit het Broek MAJ, Teunter RH, De Jonge B, 等。基于状态的联合维护和基于状态的生产优化。可靠性工程与系统安全 2021;214:107743。

[58] Mahmoodzadeh Z, Wu KY, Lopez Droguett E, et al. Condition-based maintenance with reinforcement learning for dry gas pipeline subject to internal corrosion. Sensors 2020;20(19).
[58] Mahmoodzadeh Z, Wu KY, Lopez Droguett E, 等. 基于状态的维护与强化学习在内腐蚀干气管道中的应用. 传感器 2020;20(19).

[59] Peng S, Feng Q. Reinforcement learning with Gaussian processes for conditionbased maintenance. Comput Ind Eng 2021;158:107321.
[59] 彭 S, 冯 Q. 基于条件的维护的高斯过程强化学习. 计算机与工业工程 2021;158:107321.

[60] Zhou Y, Li B, Lin TR. Maintenance optimisation of multicomponent systems using hierarchical coordinated reinforcement learning. Reliab Eng Syst Saf 2022;217: 108078.
[60] 周勇, 李斌, 林天瑞. 使用分层协调强化学习的多组分系统维护优化. 可靠性工程与系统安全 2022;217: 108078.

[61] Uit Het Broek MAJ, Teunter RH, De Jonge B, et al. Joint condition-based maintenance and load-sharing optimization for two-unit systems with economic dependency. Eur J Oper Res 2021;295(3):1119-31.
[61] Uit Het Broek MAJ, Teunter RH, De Jonge B, 等. 基于联合状态的维护和负载共享优化在经济依赖的双单位系统中的应用. 欧洲运筹学杂志 2021;295(3):1119-31.

[62] Najafi S, Lee C-G. A deep reinforcement learning approach for repair-based maintenance of multi-unit systems using proportional hazards model. Reliab Eng Syst Saf 2023;234:109179.
[62] Najafi S, Lee C-G. 一种基于深度强化学习的多单元系统维修维护方法,采用比例风险模型。可靠性工程与系统安全 2023;234:109179。

[63] Andriotis CP, Papakonstantinou KG. Managing engineering systems with large state and action spaces through deep reinforcement learning. Reliab Eng Syst Saf 2019 ; 191 : 106483 2019 ; 191 : 106483 2019;191:1064832019 ; 191: 106483.
[63] Andriotis CP, Papakonstantinou KG. 通过深度强化学习管理具有大状态和动作空间的工程系统。可靠性工程与系统安全 2019 ; 191 : 106483 2019 ; 191 : 106483 2019;191:1064832019 ; 191: 106483 .

[64] Andriotis CP, Papakonstantinou KG. Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints. Reliab Eng Syst Saf 2021;212:107551.
[64] Andriotis CP, Papakonstantinou KG. 在不完整信息和约束条件下,基于深度强化学习的检查和维护规划。可靠性工程与系统安全 2021;212:107551。

[65] Huang J, Chang Q, Arinez J. Deep reinforcement learning based preventive maintenance policy for serial production lines. Expert Syst Appl 2020;160:113701.
[65] 黄杰, 常青, 阿里内兹. 基于深度强化学习的串行生产线预防性维护策略. 专家系统应用 2020;160:113701.

[66] Yousefi N, Tsianikas S, Coit DW. Dynamic maintenance model for a repairable multi-component system using deep reinforcement learning. Qual Eng 2021;34(1): 16 35 16 35 16-3516-35.
[66] Yousefi N, Tsianikas S, Coit DW. 使用深度强化学习的可修复多组件系统动态维护模型。质量工程 2021;34(1): 16 35 16 35 16-3516-35 .

[67] Zhang N, Si W. Deep reinforcement learning for condition-based maintenance planning of multi-component systems under dependent competing risks. Reliab Eng Syst Saf 2020;203:107094.
[67] 张宁, 司伟. 基于条件的多组件系统维护规划的深度强化学习,考虑依赖竞争风险. 可靠性工程与系统安全 2020;203:107094.

[68] Zhang N, Deng Y, Liu B, et al. Condition-based maintenance for a multi-component system in a dynamic operating environment. Reliab Eng Syst Saf 2023;231:108988.
[68] 张宁, 邓勇, 刘斌, 等. 动态操作环境中多组件系统的基于状态的维护. 可靠性工程与系统安全 2023;231:108988.

[69] Liu B, Pandey MD, Wang X, et al. A finite-horizon condition-based maintenance policy for a two-unit system with dependent degradation processes. Eur J Oper Res 2021;295(2):705-17.
[69] 刘 B, 潘迪 MD, 王 X, 等. 一种基于有限时间条件的两单元系统的维护策略,具有依赖性退化过程. 欧洲运筹学杂志 2021;295(2):705-17.

[70] Xu J, Liu B, Zhao X, Wang XL. Online reinforcement learning for condition-based group maintenance using factored Markov decision processes. Eur J Oper Res 2024;315(1):176-90.
[70] 许杰, 刘斌, 赵鑫, 王晓丽. 基于条件的群体维护的在线强化学习:使用分解的马尔可夫决策过程. 欧洲运筹学杂志 2024;315(1):176-90.

[71] Do P, Nguyen VT, Voisin A, et al. Multi-agent deep reinforcement learning-based maintenance optimization for multi-dependent component systems. Expert Syst Appl 2024;245:123144.
[71] Do P, Nguyen VT, Voisin A, 等. 基于多智能体深度强化学习的多依赖组件系统维护优化. 专家系统应用 2024;245:123144.

[72] Pinciroli L, Baraldi P, Ballabio G, et al. Deep reinforcement learning based on proximal policy optimization for the maintenance of a wind farm with multiple crews. Energies 2021;14(20):6743.
[72] Pinciroli L, Baraldi P, Ballabio G, 等. 基于近端策略优化的深度强化学习用于多团队风电场的维护. Energies 2021;14(20):6743.

[73] Pinciroli L, Baraldi P, Ballabio G, et al. Optimization of the operation and maintenance of renewable energy systems by deep reinforcement learning. Renew Energy 2022;183:752-63.
[73] Pinciroli L, Baraldi P, Ballabio G, 等. 通过深度强化学习优化可再生能源系统的运行和维护. 可再生能源 2022;183:752-63.

[74] Koopmans M, De Jonge B. Condition-based maintenance and production speed optimization under limited maintenance capacity. Comput Ind Eng 2023;179: 109155.
[74] Koopmans M, De Jonge B. 基于状态的维护和在有限维护能力下的生产速度优化。计算机与工业工程 2023;179: 109155。

[75] Ong KSH, Wang W, Hieu NQ, et al. Predictive maintenance model for IIoT-based manufacturing: a transferable deep reinforcement learning approach. IEEE Internet Things J 2022;9(17):15725-41.
[75] Ong KSH, Wang W, Hieu NQ, 等. 基于工业物联网的制造预测性维护模型:一种可转移的深度强化学习方法. IEEE 互联网事物杂志 2022;9(17):15725-41.

[76] J. Schulman, F. Wolski, P. Dhariwal, et al. Proximal policy optimization algorithms. arXiv preprint arXiv: 1707.06347, 2017.
[76] J. Schulman, F. Wolski, P. Dhariwal 等. 近端策略优化算法. arXiv 预印本 arXiv: 1707.06347, 2017.

    • Corresponding author at: No. 2006, Xiyuan Avenue, West High-Tech Zone, Chengdu, Sichuan 611731, PR China.
      通讯作者:中华人民共和国四川省成都市西高新技术区西苑大道 2006 号,邮政编码 611731。
    E-mail address: yuliu@uestc.edu.cn (Y. Liu).
    电子邮件地址:yuliu@uestc.edu.cn(刘阳)。

    https://doi.org/10.1016/j.ress.2024.110401
    Received 22 April 2024; Received in revised form 13 July 2024; Accepted 26 July 2024
    收到日期:2024 年 4 月 22 日;修订后收到日期:2024 年 7 月 13 日;接受日期:2024 年 7 月 26 日

    Available online 28 July 2024
    在线可用日期:2024 年 7 月 28 日

    0951-8320/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
    0951-8320/© 2024 爱思唯尔有限公司。保留所有权利,包括文本和数据挖掘、人工智能训练及类似技术的权利。