这是用户在 2024-9-30 15:27 为 https://app.immersivetranslate.com/pdf-pro/d3ce149a-876d-4a3e-bfcd-f901e5d749a6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Reinforcement learning in reliability and maintenance optimization: A tutorial
可靠性和维护优化中的强化学习:教程

Qin Zhang a a ^(a){ }^{a}, Yu Liu a , b , a , b , ^(a,b,^(**)){ }^{a, b,{ }^{*}}, Yisha Xiang c c ^(c){ }^{\mathrm{c}}, Tangfan Xiahou a , b a , b ^(a,b){ }^{\mathrm{a}, \mathrm{b}}
张琴 a a ^(a){ }^{a} , 刘宇 a , b , a , b , ^(a,b,^(**)){ }^{a, b,{ }^{*}} , 向怡莎 c c ^(c){ }^{\mathrm{c}} , 夏侯唐凡 a , b a , b ^(a,b){ }^{\mathrm{a}, \mathrm{b}}
a a ^(a){ }^{a} School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, PR China
中国电子科技大学机械与电气工程学院,四川省成都市 611731,中华人民共和国
b b ^(b){ }^{\mathrm{b}} Center for System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu, Sichuan 611731, PR China
中国电子科技大学系统可靠性与安全中心,四川省成都市 611731,中华人民共和国
c ^("c "){ }^{\text {c }} Department of Industrial Engineering, University of Houston, Houston, TX 77004, USA
休斯顿大学工业工程系,美国德克萨斯州休斯顿,邮政编码 77004

A R T I C L E I N F O
文章信息

Keywords: 关键词:

Markov decision process 马尔可夫决策过程
Reinforcement learning 强化学习
Reliability optimization
可靠性优化

Maintenance optimization
维护优化

Abstract 摘要

A B S T R A C T The increasing complexity of engineering systems presents significant challenges in addressing intricate reliability and maintenance optimization problems. Advanced computational techniques have become imperative. Reinforcement learning, with its strong capability of solving complicated sequential decision-making problems under uncertainty, has emerged as a powerful tool to reliability and maintainability community. This paper offers a step-by-step guideline on the utilization of reinforcement learning algorithms for resolving reliability optimization and maintenance optimization problems. We first introduce the Markov decision process modeling framework for these problems and elucidate the design and implementation of solution algorithms, including dynamic programming, reinforcement learning, and deep reinforcement learning. Case studies, including a pipeline system mission abort optimization and a manufacturing system condition-based maintenance decisionmaking, are included to demonstrate the utility of reinforcement learning in reliability and maintenance applications. This tutorial will assist researchers in the reliability and maintainability community by summarizing the state-of-the-art reinforcement learning algorithms and providing the hand-on implementations in reliability and maintenance optimization problems.
摘要 工程系统日益复杂,给解决复杂的可靠性和维护优化问题带来了重大挑战。先进的计算技术已变得不可或缺。强化学习凭借其在不确定性下解决复杂序列决策问题的强大能力,已成为可靠性和可维护性领域的有力工具。本文提供了利用强化学习算法解决可靠性优化和维护优化问题的逐步指南。我们首先介绍了这些问题的马尔可夫决策过程建模框架,并阐明了解决算法的设计和实现,包括动态规划、强化学习和深度强化学习。案例研究包括管道系统任务中止优化和制造系统基于状态的维护决策,旨在展示强化学习在可靠性和维护应用中的实用性。 本教程将通过总结最先进的强化学习算法并提供在可靠性和维护优化问题中的实际实现,帮助可靠性和可维护性领域的研究人员。

1. Introduction 1. 引言

Engineered systems inevitably experience degradation, leading to unexpected shutdowns and significant adverse consequences [1,2]. Reliability and maintenance optimization, which aims to retain and improve the system reliability through intervention efforts, including maintenance, mission abort, and system reconfiguration, are of paramount concern throughout the operation of any system [3-5]. As a system gradually degrades over time, reliability and maintenance optimization are typically sequential decision-making processes, where the optimal decisions have to be made in response to the dynamic changes in the system’s state [6,7]. Markov decision process (MDP) provides an analytic framework for reliability and maintenance optimization, in which the system state degrades over operation and the optimal actions of each state are identified [8]. However, solving MDPs typically faces scalability and dimensionality issues. The computational burden associated with traditional algorithms (e.g., dynamic programming) significantly increases as the problem size grows, thus limiting their applicability in engineering instances.
工程系统不可避免地会经历退化,导致意外停机和显著的不利后果 [1,2]。可靠性和维护优化旨在通过干预措施(包括维护、任务中止和系统重配置)保持和提高系统可靠性,这在任何系统的运行过程中都是至关重要的关注点 [3-5]。随着系统随时间逐渐退化,可靠性和维护优化通常是顺序决策过程,必须根据系统状态的动态变化做出最佳决策 [6,7]。马尔可夫决策过程(MDP)为可靠性和维护优化提供了分析框架,其中系统状态在运行中退化,并识别每个状态的最佳行动 [8]。然而,解决 MDP 通常面临可扩展性和维度问题。随着问题规模的增长,传统算法(例如动态规划)所带来的计算负担显著增加,从而限制了它们在工程实例中的适用性。

Reinforcement learning (RL), a critical paradigm in the field of machine learning, has emerged as a powerful and effective tool for solving MDPs. RL is an efficient framework that focuses on sequential decisionmaking problems to determine the optimal actions to maximize cumulative rewards [9]. The concept of RL can be traced back to the early works in behavioral psychology and neuroscience regarding how animals learn behaviors through interactions with environment, inspired by Pavlovian and instrumental conditioning [10]. RL utilizes a trial-and-error learning paradigm, gaining knowledge through interactions with environment and shaping behavior with rewards and punishments. Its powerful mechanism for learning optimal actions from interaction experiences enables it to operate in a model-free manner and eliminate the need for pre-labeled data. This attribute confers a high degree of adaptability and flexibility across various complicated engineering scenarios. The RL paradigm reflects advances in computational capabilities and algorithmic strategies that have overcome limitations related to scalability and dimensionality in MDPs. Additionally, the advances in deep learning have further enhanced RL, culminating in the transformative paradigm known as deep reinforcement learning (DRL).
强化学习(RL)是机器学习领域中的一个关键范式,已成为解决马尔可夫决策过程(MDP)的强大有效工具。RL 是一个高效的框架,专注于顺序决策问题,以确定最大化累积奖励的最佳行动。RL 的概念可以追溯到早期的行为心理学和神经科学研究,探讨动物如何通过与环境的互动学习行为,受到巴甫洛夫条件反射和工具性条件反射的启发。RL 利用试错学习范式,通过与环境的互动获得知识,并通过奖励和惩罚塑造行为。其从互动经验中学习最佳行动的强大机制使其能够以无模型的方式运作,消除了对预标记数据的需求。这一特性赋予了其在各种复杂工程场景中的高度适应性和灵活性。RL 范式反映了计算能力和算法策略的进步,克服了与 MDP 相关的可扩展性和维度限制。 此外,深度学习的进步进一步增强了强化学习,最终形成了被称为深度强化学习(DRL)的变革性范式。
DRL leverages the efficient function approximation and generalization capabilities of deep neural networks to handle high-dimensional and continuous state spaces effectively. Furthermore, an additional advantage of integrating with deep neural networks is the capability to process diverse types of input data, including images, text, and audio, thereby extending DRL’s applicability to a broader spectrum of engineering scenarios. The strong capabilities of RL and DRL in addressing complicated and large-scale problems has rendered them among the most prevalent tools to the reliability and maintainability community [11]. In this article, we aim to offer a step-by-step tutorial on RL for reliability and maintenance optimization. Starting with the MDP formulation for reliability and maintenance optimization problems, this tutorial elucidates the design and implementation of RL algorithms and provides pointers to several case studies in engineering applications.
DRL 利用深度神经网络高效的函数逼近和泛化能力,有效处理高维和连续状态空间。此外,深度神经网络的另一个优势是能够处理多种类型的输入数据,包括图像、文本和音频,从而将 DRL 的适用性扩展到更广泛的工程场景。强化学习(RL)和深度强化学习(DRL)在解决复杂和大规模问题方面的强大能力,使其成为可靠性和可维护性领域中最常用的工具之一[11]。在本文中,我们旨在提供一个关于可靠性和维护优化的 RL 逐步教程。该教程从可靠性和维护优化问题的马尔可夫决策过程(MDP)公式化开始,阐明了 RL 算法的设计和实现,并提供了多个工程应用案例研究的指引。
The rest of this article are rolled out as follows. Section 2 provides an overview of RL and DRL. Section 3 presents the application of RL in reliability optimization, and Section 4 introduces its application in maintenance optimization. Section 5 closes this tutorial.
本文的其余部分安排如下。第二节提供了强化学习(RL)和深度强化学习(DRL)的概述。第三节介绍了强化学习在可靠性优化中的应用,第四节介绍了其在维护优化中的应用。第五节结束本教程。

2. Reinforcement learning and deep reinforcement learning
2. 强化学习与深度强化学习

2.1. Markov decision process
2.1. 马尔可夫决策过程

MDP is a classical framework for sequential decision-making problems, and it is defined on sequences of states and actions and driven by Markov chains. In the sequential decision process, an action is selected in each state which will impact not just immediate reward, but also subsequent states and future rewards [12]. The goal of an MDP is to develop a policy that selects the optimal actions of each state to maximize the expected accumulated rewards over time. MDPs have widespread applications in engineering scenarios and can be traced back to the pioneer work of Richard Bellman [13]. An MDP can be described by a tuple of five elements ( S , A , P , R , γ ) ( S , A , P , R , γ ) (S,A,P,R,gamma)(\mathscr{S}, \mathscr{A}, P, R, \gamma), where S S S\mathscr{S} is the state space containing all the states, A A A\mathscr{A} is the action space with a set of feasible actions, P : S × A [ 0 , 1 ] P : S × A [ 0 , 1 ] P:SxxArarr[0,1]P: \mathscr{S} \times \mathscr{A} \rightarrow[0,1] is the transition probability matrix which characterizes the dynamics of the environment, R R RR is the reward function, and γ γ gamma in\gamma \in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] is the discount factor that is used to discount future rewards and thus focuses on immediate versus long-term rewards. Note that the state of a system typically represents the specific condition of the environment or system being modeled, and it encapsulates all the relevant information needed to drive the decisions.
MDP 是一个经典的序列决策问题框架,它在状态和动作的序列上定义,并由马尔可夫链驱动。在序列决策过程中,在每个状态中选择一个动作,这不仅会影响即时奖励,还会影响后续状态和未来奖励[12]。MDP 的目标是制定一个策略,以选择每个状态的最佳动作,从而最大化随时间累积的期望奖励。MDP 在工程场景中有广泛的应用,可以追溯到理查德·贝尔曼的开创性工作[13]。一个 MDP 可以用五个元素的元组 ( S , A , P , R , γ ) ( S , A , P , R , γ ) (S,A,P,R,gamma)(\mathscr{S}, \mathscr{A}, P, R, \gamma) 来描述,其中 S S S\mathscr{S} 是包含所有状态的状态空间, A A A\mathscr{A} 是具有一组可行动作的动作空间, P : S × A [ 0 , 1 ] P : S × A [ 0 , 1 ] P:SxxArarr[0,1]P: \mathscr{S} \times \mathscr{A} \rightarrow[0,1] 是描述环境动态的转移概率矩阵, R R RR 是奖励函数,而 γ γ gamma in\gamma \in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] 是用于折现未来奖励的折现因子,从而关注即时奖励与长期奖励的关系。 请注意,系统的状态通常表示所建模环境或系统的特定条件,并封装了驱动决策所需的所有相关信息。
At each decision step t t tt, a decision maker selects an action a t a t a_(t)a_{t} for the current state s t s t s_(t)s_{t}, and the system transitions to a new state s t + 1 s t + 1 s_(t+1)s_{t+1} following the transition probability P ( s t + 1 s t , a t ) P s t + 1 s t , a t P(s_(t+1)∣s_(t),a_(t))P\left(s_{t+1} \mid s_{t}, a_{t}\right) and obtain reward r t r t r_(t)r_{t}. A solution to an MDP is provided in terms of a policy π π pi\pi, which maps each state to an action to take in this state. A deterministic policy maps states to actions, such that it prescribes which action to take in each state π : S A . A π : S A . A pi:SrarrA.A\pi: \mathscr{S} \rightarrow \mathscr{A} . \mathrm{A} stochastic policy prescribes the probability of each action in each state π : S × A [ 0 , 1 ] π : S × A [ 0 , 1 ] pi:SxxArarr[0,1]\pi: \mathscr{S} \times \mathscr{A} \rightarrow[0,1], where π ( a t s t ) π a t s t pi(a_(t)∣s_(t))\pi\left(a_{t} \mid s_{t}\right) is the probability of selecting action a t a t a_(t)a_{t} in state s t s t s_(t)s_{t}. The goal of MDP is to obtain an optimal policy that maximizes the expected accumulated rewards. MDPs can be defined over a finite or infinite horizon. The former is characterized by their episodic nature, encompassing systems with clearly delineated terminal states, thus confining the decision-making process to a predefined temporal boundary. Conversely, the latter, lacking explicit terminal states, represents systems where the decision-making extends indefinitely, thereby accommodating continuous adaptation to evolving process.
在每个决策步骤 t t tt 中,决策者为当前状态 s t s t s_(t)s_{t} 选择一个动作 a t a t a_(t)a_{t} ,系统根据转移概率 P ( s t + 1 s t , a t ) P s t + 1 s t , a t P(s_(t+1)∣s_(t),a_(t))P\left(s_{t+1} \mid s_{t}, a_{t}\right) 转移到新状态 s t + 1 s t + 1 s_(t+1)s_{t+1} 并获得奖励 r t r t r_(t)r_{t} 。马尔可夫决策过程(MDP)的解决方案以策略 π π pi\pi 的形式提供,该策略将每个状态映射到在该状态下采取的动作。确定性策略将状态映射到动作,从而规定在每个状态下应采取的动作 π : S A . A π : S A . A pi:SrarrA.A\pi: \mathscr{S} \rightarrow \mathscr{A} . \mathrm{A} ,而随机策略则规定在每个状态下每个动作的概率 π : S × A [ 0 , 1 ] π : S × A [ 0 , 1 ] pi:SxxArarr[0,1]\pi: \mathscr{S} \times \mathscr{A} \rightarrow[0,1] ,其中 π ( a t s t ) π a t s t pi(a_(t)∣s_(t))\pi\left(a_{t} \mid s_{t}\right) 是选择在状态 s t s t s_(t)s_{t} 中采取动作 a t a t a_(t)a_{t} 的概率。MDP 的目标是获得一个最优策略,以最大化期望的累积奖励。MDP 可以定义在有限或无限的时间范围内。前者的特点是其情节性,包含具有明确终止状态的系统,从而将决策过程限制在预定义的时间边界内。相反,后者缺乏明确的终止状态,代表了决策过程无限延续的系统,从而适应不断发展的过程。
The value function of state s s ss under policy π π pi\pi is the expected accumulated rewards when starting in s s ss and following π π pi\pi, and it is represented as:
在策略 π π pi\pi 下,状态 s s ss 的价值函数是从 s s ss 开始并遵循 π π pi\pi 时的期望累积奖励,表示为:

v π ( s ) = E [ k = 0 γ k r t + k + 1 s t = s ] , s S v π ( s ) = E k = 0 γ k r t + k + 1 s t = s , s S v_(pi)(s)=E[sum_(k=0)^(oo)gamma^(k)r_(t+k+1)∣s_(t)=s],AA s inSv_{\pi}(s)=\mathbb{E}\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1} \mid s_{t}=s\right], \forall s \in \mathscr{S}
Similar to the value functions, the action-value function q π ( s , a ) q π ( s , a ) q_(pi)(s,a)q_{\pi}(s, a) defines the expected accumulated rewards of taking action a a aa in state s s ss and follow policy π π pi\pi onwards, and it is formulated as:
类似于价值函数,动作价值函数 q π ( s , a ) q π ( s , a ) q_(pi)(s,a)q_{\pi}(s, a) 定义了在状态 s s ss 下采取动作 a a aa 并遵循策略 π π pi\pi 之后的预期累积奖励,其公式为:

q π ( s , a ) = E [ k = 0 γ k r t + k + 1 s t = s , a t = a ] , s S , a A q π ( s , a ) = E k = 0 γ k r t + k + 1 s t = s , a t = a , s S , a A q_(pi)(s,a)=E[sum_(k=0)^(oo)gamma^(k)r_(t+k+1)∣s_(t)=s,a_(t)=a],AA s inS,AA a inAq_{\pi}(s, a)=\mathbb{E}\left[\sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1} \mid s_{t}=s, a_{t}=a\right], \forall s \in \mathscr{S}, \forall a \in \mathscr{A}.
The value function and action-value function can be formulated as the following recursive forms:
价值函数和动作价值函数可以表述为以下递归形式:

v π ( s ) = a A π ( a s ) [ r ( s , a ) + γ s S P ( s s , a ) v π ( s ) ] , s S v π ( s ) = a A π ( a s ) r ( s , a ) + γ s S P s s , a v π s , s S v_(pi)(s)=sum_(a inA)pi(a∣s)[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(pi)(s^('))],AA s inSv_{\pi}(s)=\sum_{a \in \mathscr{A}} \pi(a \mid s)\left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{\pi}\left(s^{\prime}\right)\right], \forall s \in \mathscr{S}
q π ( s , a ) = r ( s , a ) + γ s P ( s s , a ) v π ( s ) , s S , a A q π ( s , a ) = r ( s , a ) + γ s P s s , a v π s , s S , a A q_(pi)(s,a)=r(s,a)+gammasum_(s^('))P(s^(')∣s,a)v_(pi)(s^(')),AA s inS,AA a inAq_{\pi}(s, a)=r(s, a)+\gamma \sum_{s^{\prime}} P\left(s^{\prime} \mid s, a\right) v_{\pi}\left(s^{\prime}\right), \forall s \in \mathscr{S}, \forall a \in \mathscr{A}
The optimal solution to an MDP refers to the policy π π pi^(**)\pi^{*} that generates the maximal expected accumulated reward over decision horizon. The optimal value function, denoted as v ( s ) v ( s ) v^(**)(s)v^{*}(s), formulates the expected accumulated reward of taking action a a aa in state s s ss following the optimal policy π π pi^(**)\pi^{*} and can be estimated by solving the well-known Bellman optimality equation [11]:
MDP 的最优解指的是策略 π π pi^(**)\pi^{*} ,它在决策时间范围内产生最大的期望累积奖励。最优价值函数,记作 v ( s ) v ( s ) v^(**)(s)v^{*}(s) ,公式化了在状态 s s ss 下,遵循最优策略 π π pi^(**)\pi^{*} 采取行动 a a aa 的期望累积奖励,并可以通过求解著名的贝尔曼最优性方程来估计[11]:
v ( s ) = max a { r ( s , a ) + γ s S P ( s s , a ) v ( s ) } , s S v ( s ) = max a r ( s , a ) + γ s S P s s , a v s , s S v^(**)(s)=max_(a){r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v^(**)(s^('))},AA s inSv^{*}(s)=\max _{a}\left\{r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v^{*}\left(s^{\prime}\right)\right\}, \forall s \in \mathscr{S}
Similar to the optimal value function, the Bellman optimality equation of the optimal action-value q ( s , a ) q ( s , a ) q^(**)(s,a)q^{*}(s, a) can be written as:
与最优值函数类似,最优动作值的贝尔曼最优性方程可以写成:

q ( s , a ) = r ( s , a ) + γ s S P ( s s , a ) max a { q ( s , a ) } , s S , a A q ( s , a ) = r ( s , a ) + γ s S P s s , a max a q s , a , s S , a A q^(**)(s,a)=r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)max_(a^(')){q^(**)(s^('),a^('))},AA s inS,AA a inAq^{*}(s, a)=r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) \max _{a^{\prime}}\left\{q^{*}\left(s^{\prime}, a^{\prime}\right)\right\}, \forall s \in \mathscr{S}, \forall a \in \mathscr{A}.
Based on the optimal value function v ( s ) v ( s ) v^(**)(s)v^{*}(s) and action-value function q ( s , a ) q ( s , a ) q^(**)(s,a)q^{*}(s, a), the optimal policy π π pi^(**)\pi^{*} can be directly estimated by:
基于最优值函数 v ( s ) v ( s ) v^(**)(s)v^{*}(s) 和动作值函数 q ( s , a ) q ( s , a ) q^(**)(s,a)q^{*}(s, a) ,最优策略 π π pi^(**)\pi^{*} 可以直接通过以下方式估计:

π ( s ) = { argmax a { r ( s , a ) + γ s S P ( s s , a ) v ( s ) } . argmax or q ( s , a ) π ( s ) = argmax a r ( s , a ) + γ s S P s s , a v s . argmax or q ( s , a ) pi^(**)(s)={[argmax _(a){r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v^(**)(s^('))}.],[argmax _(or)q^(**)(s","a)]:}\pi^{*}(s)=\left\{\begin{array}{l}\underset{a}{\operatorname{argmax}}\left\{r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v^{*}\left(s^{\prime}\right)\right\} . \\ \underset{\operatorname{or}}{\operatorname{argmax}} q^{*}(s, a)\end{array}\right.
Dynamic programming (DP) provides an exact solution framework for MDPs, based on the premise that a perfect model of the environment is available. The DP methods leverage the Bellman optimality equation, decomposing the value function into immediate rewards and discounted future values, thus facilitating a recursive formulation for estimating the optimal policy π ( s ) π ( s ) pi^(**)(s)\pi^{*}(s). The two most popular methods in DP are policy iteration and value iteration. Policy iteration searches for the optimal policy by two iterative steps: policy evaluation and policy improvement. Policy evaluation iteratively evaluates the value function for all states under policy π π pi\pi using the Bellman equation:
动态规划(DP)为马尔可夫决策过程(MDP)提供了一个精确的解决框架,前提是环境的完美模型可用。DP 方法利用贝尔曼最优性方程,将价值函数分解为即时奖励和折扣未来值,从而促进了估计最优策略的递归公式 π ( s ) π ( s ) pi^(**)(s)\pi^{*}(s) 。DP 中最流行的两种方法是策略迭代和价值迭代。策略迭代通过两个迭代步骤搜索最优策略:策略评估和策略改进。策略评估在策略 π π pi\pi 下迭代评估所有状态的价值函数,使用贝尔曼方程:

v k + 1 ( s ) a A π ( a s ) [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) a A π ( a s ) r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larrsum_(a inA)pi(a∣s)[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \sum_{a \in \mathscr{A}} \pi(a \mid s)\left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right], for all s S s S s inSs \in \mathscr{S},
v k + 1 ( s ) a A π ( a s ) [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) a A π ( a s ) r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larrsum_(a inA)pi(a∣s)[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \sum_{a \in \mathscr{A}} \pi(a \mid s)\left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right] ,对于所有 s S s S s inSs \in \mathscr{S}

followed by a policy improvement step where the policy is updated by choosing actions that maximize the expected utility:
接着是一个政策改进步骤,在该步骤中,通过选择最大化期望效用的行动来更新政策:

π ( s ) argmax a { r ( s , a ) + γ s S P ( s s , a ) v π ( s ) } π ( s ) argmax a r ( s , a ) + γ s S P s s , a v π s pi(s)larrargmax _(a){r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(pi)(s^('))}\pi(s) \leftarrow \underset{a}{\operatorname{argmax}}\left\{r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{\pi}\left(s^{\prime}\right)\right\}.
Value iteration directly estimate the value function for each state using the Bellman optimality equation until it converges to the optimal value function. After identifying the optimal value function, the optimal policy can be trivially extracted by straightforward traversing through the states and identifying the actions corresponding to the highest values. It can be written as a simple update form that combines the policy improvement and truncated policy evaluation steps:
值迭代直接使用贝尔曼最优性方程估计每个状态的价值函数,直到其收敛到最优价值函数。在识别出最优价值函数后,可以通过简单遍历状态并识别对应于最高值的动作来轻松提取最优策略。它可以写成一个简单的更新形式,结合了策略改进和截断策略评估步骤:

v k + 1 ( s ) max [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) max r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larr max[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \max \left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right], for all s S s S s inSs \in \mathscr{S}.
v k + 1 ( s ) max [ r ( s , a ) + γ s S P ( s s , a ) v k ( s ) ] v k + 1 ( s ) max r ( s , a ) + γ s S P s s , a v k s v_(k+1)(s)larr max[r(s,a)+gammasum_(s^(')inS)P(s^(')∣s,a)v_(k)(s^('))]v_{k+1}(s) \leftarrow \max \left[r(s, a)+\gamma \sum_{s^{\prime} \in \mathscr{S}} P\left(s^{\prime} \mid s, a\right) v_{k}\left(s^{\prime}\right)\right]