This paper is concerned with a new discrete-time policy iteration adaptive dynamic programming (ADP) method for solving the infinite horizon optimal control problem of nonlinear systems. The idea is to use an iterative ADP technique to obtain the iterative control law, which optimizes the iterative performance index function. The main contribution of this paper is to analyze the convergence and stability properties of policy iteration method for discrete-time nonlinear systems for the first time. It shows that the iterative performance index function is nonincreasingly convergent to the optimal solution of the Hamilton-Jacobi-Bellman equation. It is also proven that any of the iterative control laws can stabilize the nonlinear systems. Neural networks are used to approximate the performance index function and compute the optimal control law, respectively, for facilitating the implementation of the iterative ADP algorithm, where the convergence of the weight matrices is analyzed. Finally, the numerical results and analysis are presented to illustrate the performance of the developed method. 本文关注一种新的离散时间策略迭代自适应动态规划(ADP)方法,用于解决非线性系统的无限时域最优控制问题。其思路是使用迭代 ADP 技术获得迭代控制律,从而优化迭代性能指标函数。本文的主要贡献是首次分析离散时间非线性系统的策略迭代方法的收敛性和稳定性特性。结果表明,迭代性能指标函数是非递增地收敛于哈密顿-雅可比-贝尔曼方程的最优解。还证明了任何迭代控制律都可以使非线性系统稳定。神经网络被用来近似性能指标函数并计算最优控制律,以便于实现迭代 ADP 算法,并分析权重矩阵的收敛性。最后,提供了数值结果和分析,以说明所开发方法的性能。
DYNAMIC programming is a very useful tool in solving optimal control problems. However, due to the curse of dimensionality [1], it is often computationally untenable to run dynamic programming for obtaining the optimal solution. The adaptive/approximate dynamic programming (ADP) algorithms were proposed in [2] and [3] as a way to solve optimal control problems forward in time and gained much attention from the researchers [4]-[12]. There are several synonyms used for ADP including adaptive critic designs [13]-[15], ADP [16]-[19], approximate dynamic programming [20], [21], neural dynamic programming [22], neurodynamic programming [23], and reinforcement learning [24]-[26]. In [14] and [21], ADP approaches were classified into several main schemes, which include heuristic dynamic 动态规划是解决最优控制问题的一个非常有用的工具。然而,由于维度诅咒[1],运行动态规划以获得最优解在计算上往往不可行。自适应/近似动态规划(ADP)算法在[2]和[3]中被提出,作为一种向前解决最优控制问题的方法,并受到研究人员的广泛关注[4]-[12]。ADP 有几个同义词,包括自适应评论设计[13]-[15]、ADP[16]-[19]、近似动态规划[20]、[21]、神经动态规划[22]、神经动态编程[23]和强化学习[24]-[26]。在[14]和[21]中,ADP 方法被分类为几种主要方案,包括启发式动态。
programming (HDP), action-dependent HDP, also known as Q-learning [27], dual heuristic dynamic programming (DHP), action-dependent DHP, globalized DHP (GDHP), and actiondependent GDHP. Iterative methods are also used in ADP to obtain the solution of the Hamilton-Jacobi-Bellman (HJB) equation indirectly and have received more and more attention [28]-[35]. There are two main iterative ADP algorithms, which are value iteration ADP (value iteration for brief) algorithm and policy iteration ADP (policy iteration for brief) algorithm [36], [37]. 编程(HDP)、依赖于动作的 HDP,也称为 Q 学习[27]、双启发式动态规划(DHP)、依赖于动作的 DHP、全球化 DHP(GDHP)和依赖于动作的 GDHP。迭代方法也用于 ADP,以间接获得汉密尔顿-雅可比-贝尔曼(HJB)方程的解,并受到越来越多的关注[28]-[35]。主要有两种迭代 ADP 算法,即值迭代 ADP(简称值迭代)算法和策略迭代 ADP(简称策略迭代)算法[36],[37]。
Value iteration algorithm for optimal control of discretetime nonlinear systems was given in [38]. In 2008, Al-Tamimi et al. [39] studied value iteration algorithm for deterministic discrete-time affine nonlinear systems, which was referred to as HDP and was proposed for finding the optimal control law. Starting from a zero initial performance index function, it is proven in [39] that the iterative performance index function is a nondecreasing sequence and bounded. When the iteration index increases to infinity, the iterative performance index function converges to the optimal performance index function, which satisfies the HJB equation. In 2008, Zhang et al. [40] applied value iteration ADP to solve optimal tracking control problems for nonlinear systems. Liu et al. [16] realized the value iteration ADP by GDHP. For the value iteration algorithm of ADP, the initial performance index function is required to be zero [16], [18], [39]-[42]. On the other hand, for value iteration algorithm, the stability of the system under the iterative control law cannot be guaranteed. This means that only the converged optimal control law can be used to control the nonlinear system, and all the iterative controls during the iteration procedure may be invalid. Therefore, the computation efficiency of the value iteration algorithm of ADP is very low. Till now, all the value iteration algorithms are implemented offline, which limit their practical applications. 离散时间非线性系统的最优控制值迭代算法在[38]中给出。2008 年,Al-Tamimi 等人[39]研究了确定性离散时间仿射非线性系统的值迭代算法,该算法被称为 HDP,旨在寻找最优控制律。从零初始性能指标函数出发,[39]中证明了迭代性能指标函数是一个非递减序列且有界。当迭代索引增加到无穷大时,迭代性能指标函数收敛到满足 HJB 方程的最优性能指标函数。2008 年,Zhang 等人[40]应用值迭代 ADP 解决非线性系统的最优跟踪控制问题。Liu 等人[16]通过 GDHP 实现了值迭代 ADP。对于 ADP 的值迭代算法,初始性能指标函数要求为零[16],[18],[39]-[42]。另一方面,对于值迭代算法,无法保证在迭代控制律下系统的稳定性。 这意味着只有收敛的最优控制律可以用来控制非线性系统,而在迭代过程中所有的迭代控制可能都是无效的。因此,ADP 的值迭代算法的计算效率非常低。到目前为止,所有的值迭代算法都是离线实现的,这限制了它们的实际应用。
Policy iteration algorithm for continuous-time systems was proposed in [17]. Murray et al. [17] proved that for continuous-time affine nonlinear systems, the iterative performance index function will converge to the optimum nonincreasingly and each of the iterative controls stabilizes the nonlinear systems using policy iteration algorithms. This is a great merit of policy iteration algorithm and hence achieved many applications for solving optimal control problems of nonlinear systems. In 2005, Abu-Khalaf and Lewis [43] proposed a policy iteration algorithm for continuous-time nonlinear system with control constraints. Zhang et al. [44] applied policy iteration algorithm to solve continuous-time nonlinear two-person zero-sum games. 在[17]中提出了连续时间系统的策略迭代算法。Murray 等人[17]证明,对于连续时间仿射非线性系统,迭代性能指标函数将以非递增方式收敛到最优,并且每个迭代控制都能通过策略迭代算法使非线性系统稳定。这是策略迭代算法的一大优点,因此在解决非线性系统的最优控制问题上取得了许多应用。2005 年,Abu-Khalaf 和 Lewis[43]提出了一种用于具有控制约束的连续时间非线性系统的策略迭代算法。Zhang 等人[44]应用策略迭代算法解决连续时间非线性两人零和博弈。
Vamvoudakis et al. [45] proposed a multiagent differential graphical games for continuous-time linear systems using policy iteration algorithms. Bhasin et al. [46] proposed an online actor-critic-identifier architecture to obtain the optimal control law for uncertain nonlinear systems by policy iteration algorithms. Till now, nearly all the online iterative ADP algorithms are policy iteration algorithms. However, almost all the discussions on the policy iteration algorithms are based on continuous-time control systems [44]-[47]. The discussions on policy iteration algorithms for discrete-time control systems are scarce. Only in [36] and [37] a brief sketch of policy iteration algorithm for discrete-time systems was displayed, whereas the stability and convergence properties were not discussed. To the best of our knowledge, there are still no discussions focused on the policy iteration algorithms for discrete-time systems, which motives our research. Vamvoudakis 等人 [45] 提出了用于连续时间线性系统的多智能体微分图形游戏,采用策略迭代算法。Bhasin 等人 [46] 提出了在线演员-评论家-识别器架构,通过策略迭代算法为不确定非线性系统获得最优控制律。到目前为止,几乎所有在线迭代自适应动态规划(ADP)算法都是策略迭代算法。然而,几乎所有关于策略迭代算法的讨论都是基于连续时间控制系统 [44]-[47]。关于离散时间控制系统的策略迭代算法的讨论很少。只有在 [36] 和 [37] 中简要展示了离散时间系统的策略迭代算法,而稳定性和收敛性特性并未讨论。根据我们所知,目前仍然没有专注于离散时间系统的策略迭代算法的讨论,这激励了我们的研究。
In this paper, it is the first time that the properties of policy iteration algorithm of ADP for discrete-time nonlinear systems are analyzed. First, the policy iteration algorithm is introduced to find the optimal control law. Second, it will show that any of the iterative control laws can stabilize the nonlinear system. Third, it will show that the iterative control laws using policy iteration algorithm can make the iterative performance index functions converge to the optimal solution monotonically. Next, an effective method is developed to obtain the initial admissible control law by repeating experiments. Furthermore, to facilitate the implementation of the policy iteration algorithm, it will show how to use neural networks for implementing the developed policy iteration algorithm to obtain the iterative performance index functions and the iterative control laws. The weight convergence properties of the neural networks are also established. In numerical examples, comparisons will be displayed to show the effectiveness of the developed algorithm. For linear systems, the control results by the policy iteration algorithm will be compared with the ones by the algebraic Riccati equation (ARE) method to justify the correctness of the developed method. For nonlinear systems, the control results by the policy iteration algorithm will be compared with the ones by value iteration algorithm to show the effectiveness of the developed algorithm. 在本文中,首次分析了离散时间非线性系统的自适应动态规划(ADP)策略迭代算法的性质。首先,引入策略迭代算法以寻找最优控制律。其次,将展示任何迭代控制律都可以稳定非线性系统。第三,将表明使用策略迭代算法的迭代控制律可以使迭代性能指标函数单调收敛到最优解。接下来,开发了一种有效的方法,通过重复实验获得初始可接受控制律。此外,为了便于实施策略迭代算法,将展示如何使用神经网络来实现所开发的策略迭代算法,以获得迭代性能指标函数和迭代控制律。还建立了神经网络的权重收敛性质。在数值示例中,将展示比较结果,以显示所开发算法的有效性。 对于线性系统,策略迭代算法的控制结果将与代数里卡提方程(ARE)方法的结果进行比较,以验证所开发方法的正确性。对于非线性系统,策略迭代算法的控制结果将与价值迭代算法的结果进行比较,以展示所开发算法的有效性。
The rest of this paper is organized as follows. In Section II, the problem formulation is presented. In Section III, the policy iteration algorithm for discrete-time nonlinear systems is derived. The stability, convergence. and optimality properties will also be proven in this section. In Section IV, the neural network implementation for the optimal control scheme is discussed. In Section V, the numerical results and analyses are presented to demonstrate the effectiveness of the developed optimal control scheme. Finally, in Section VI, the conclusion is drawn and our future work is declared. 本文的其余部分组织如下。第二节介绍了问题的表述。第三节推导了离散时间非线性系统的策略迭代算法。本节还将证明稳定性、收敛性和最优性特性。第四节讨论了最优控制方案的神经网络实现。第五节展示了数值结果和分析,以证明所开发的最优控制方案的有效性。最后,第六节总结了结论并阐明了我们的未来工作。
II. Problem Formulation II. 问题表述
In this paper, we will study the following deterministic discrete-time systems: 在本文中,我们将研究以下确定性离散时间系统:
where x_(k)inR^(n)x_{k} \in \mathbb{R}^{n} is the nn-dimensional state vector and u_(k)inR^(m)u_{k} \in \mathbb{R}^{m} is the mm-dimensional control vector. Let x_(0)x_{0} be the initial state and F(x_(k),u_(k))F\left(x_{k}, u_{k}\right) be the system function. 其中 x_(k)inR^(n)x_{k} \in \mathbb{R}^{n} 是 nn 维状态向量, u_(k)inR^(m)u_{k} \in \mathbb{R}^{m} 是 mm 维控制向量。设 x_(0)x_{0} 为初始状态, F(x_(k),u_(k))F\left(x_{k}, u_{k}\right) 为系统函数。
Let u__(k)={u_(k),u_(k+1),dots}\underline{u}_{k}=\left\{u_{k}, u_{k+1}, \ldots\right\} be an arbitrary sequence of controls from kk to oo\infty. The performance index function for state x_(0)x_{0} under the control sequence u__(0)={u_(0),u_(1),dots}\underline{u}_{0}=\left\{u_{0}, u_{1}, \ldots\right\} is defined as 让 u__(k)={u_(k),u_(k+1),dots}\underline{u}_{k}=\left\{u_{k}, u_{k+1}, \ldots\right\} 成为从 kk 到 oo\infty 的任意控制序列。在控制序列 u__(0)={u_(0),u_(1),dots}\underline{u}_{0}=\left\{u_{0}, u_{1}, \ldots\right\} 下状态 x_(0)x_{0} 的性能指标函数定义为
where U(x_(k),u_(k)) > 0U\left(x_{k}, u_{k}\right)>0, for AAx_(k),u_(k)!=0\forall x_{k}, u_{k} \neq 0, is the utility function. 在这里, U(x_(k),u_(k)) > 0U\left(x_{k}, u_{k}\right)>0 对于 AAx_(k),u_(k)!=0\forall x_{k}, u_{k} \neq 0 是效用函数。
We will study optimal control problems for (1). The goal of this paper is to find an optimal control scheme, which stabilizes (1) and simultaneously minimizes the performance index function (2). For convenience of analysis, results of this paper are based on the following assumption. 我们将研究(1)的最优控制问题。本文的目标是找到一个最优控制方案,既能稳定(1),又能同时最小化性能指标函数(2)。为了方便分析,本文的结果基于以下假设。
Assumption 2.1: System (1) is controllable; the system state x_(k)=0x_{k}=0 is an equilibrium state of (1) under the control u_(k)=0u_{k}=0, i.e., F(0,0)=0F(0,0)=0; the feedback control u_(k)=u(x_(k))u_{k}=u\left(x_{k}\right) satisfies u_(k)=u(x_(k))=0u_{k}=u\left(x_{k}\right)=0 for x_(k)=0x_{k}=0; the utility function U(x_(k),u_(k))U\left(x_{k}, u_{k}\right) is a positive definite function for any x_(k)x_{k} and u_(k)u_{k}. 假设 2.1:系统 (1) 是可控的;系统状态 x_(k)=0x_{k}=0 是在控制 u_(k)=0u_{k}=0 下 (1) 的平衡状态,即 F(0,0)=0F(0,0)=0 ;反馈控制 u_(k)=u(x_(k))u_{k}=u\left(x_{k}\right) 满足 u_(k)=u(x_(k))=0u_{k}=u\left(x_{k}\right)=0 对于 x_(k)=0x_{k}=0 ;效用函数 U(x_(k),u_(k))U\left(x_{k}, u_{k}\right) 对于任何 x_(k)x_{k} 和 u_(k)u_{k} 是正定函数。
As (1) is controllable, there exists a stable control sequence u__(k)={u_(k),u_(k+1)dots}\underline{u}_{k}=\left\{u_{k}, u_{k+1} \ldots\right\} that moves x_(k)x_{k} to zero. Let A__(k)\underline{\mathfrak{A}}_{k} denote the set, which contains all the stable control sequences. Then, the optimal performance index function can be defined as 由于 (1) 是可控的,存在一个稳定的控制序列 u__(k)={u_(k),u_(k+1)dots}\underline{u}_{k}=\left\{u_{k}, u_{k+1} \ldots\right\} ,将 x_(k)x_{k} 移动到零。令 A__(k)\underline{\mathfrak{A}}_{k} 表示包含所有稳定控制序列的集合。然后,最优性能指标函数可以定义为
J^(**)(x_(k))=i n f{J(x_(k),u__(k)):u__(k)inA__(k)}J^{*}\left(x_{k}\right)=\inf \left\{J\left(x_{k}, \underline{u}_{k}\right): \underline{u}_{k} \in \underline{\mathfrak{A}}_{k}\right\}
According to Bellman’s principle of optimality, J^(**)(x_(k))J^{*}\left(x_{k}\right) satisfies the following discrete-time HJB equation: 根据贝尔曼最优性原理, J^(**)(x_(k))J^{*}\left(x_{k}\right) 满足以下离散时间 HJB 方程:
J^(**)(x_(k))=i n f_(u_(k)){U(x_(k),u_(k))+J^(**)(F(x_(k),u_(k)))}J^{*}\left(x_{k}\right)=\inf _{u_{k}}\left\{U\left(x_{k}, u_{k}\right)+J^{*}\left(F\left(x_{k}, u_{k}\right)\right)\right\}
Define the law of optimal control as 定义最优控制法则为
u^(**)(x_(k))=arg i n f_(u_(k)){U(x_(k),u_(k))+J^(**)(F(x_(k),u_(k)))}u^{*}\left(x_{k}\right)=\arg \inf _{u_{k}}\left\{U\left(x_{k}, u_{k}\right)+J^{*}\left(F\left(x_{k}, u_{k}\right)\right)\right\}
Hence, the HJB equation (4) can be written as 因此,HJB 方程 (4) 可以写成
We can see that if we want to obtain the optimal control law u^(**)(x_(k))u^{*}\left(x_{k}\right), we must obtain the optimal performance index function J^(**)(x_(k))J^{*}\left(x_{k}\right). Generally, J^(**)(x_(k))J^{*}\left(x_{k}\right) is unknown before all the controls u_(k)inR^(n)u_{k} \in \mathbb{R}^{n} are considered. If we adopt the traditional dynamic programming method to obtain the optimal performance index function at every time step, then we have to face the curse of dimensionality. Furthermore, the optimal control is discussed in infinite horizon. This means the length of the control sequence is infinite, which makes the optimal control nearly impossible to obtain by the HJB equation (6). To overcome this difficulty, a new iterative algorithm based on ADP is developed. 我们可以看到,如果我们想获得最优控制律 u^(**)(x_(k))u^{*}\left(x_{k}\right) ,就必须获得最优性能指标函数 J^(**)(x_(k))J^{*}\left(x_{k}\right) 。一般来说,在考虑所有控制 u_(k)inR^(n)u_{k} \in \mathbb{R}^{n} 之前, J^(**)(x_(k))J^{*}\left(x_{k}\right) 是未知的。如果我们采用传统的动态规划方法在每个时间步获得最优性能指标函数,那么我们就必须面对维度诅咒。此外,最优控制是在无限时间范围内讨论的。这意味着控制序列的长度是无限的,这使得通过 HJB 方程(6)几乎不可能获得最优控制。为了克服这个困难,开发了一种基于 ADP 的新迭代算法。
III. Policy ItERation Algorithm III. 政策迭代算法
In this section, the discrete-time policy iteration algorithm of ADP is developed to obtain the optimal controller for nonlinear systems. The goal of the developed policy iteration algorithm is to construct an iterative control law v_(i)(x_(k))v_{i}\left(x_{k}\right), which moves an arbitrary initial state x_(0)x_{0} to the equilibrium 0 , and 在本节中,开发了自适应动态规划的离散时间策略迭代算法,以获得非线性系统的最优控制器。所开发的策略迭代算法的目标是构造一个迭代控制律 v_(i)(x_(k))v_{i}\left(x_{k}\right) ,将任意初始状态 x_(0)x_{0} 移动到平衡点 0。
simultaneously makes the iterative performance index function to reach the optimum. Stability proofs will be given to show that any of the iterative controls can stabilize the nonlinear system. Convergence and optimality proofs will also be given to show that the performance index functions will converge to the optimum. 同时使迭代性能指标函数达到最优。将提供稳定性证明,以表明任何迭代控制都可以使非线性系统稳定。还将提供收敛性和最优性证明,以表明性能指标函数将收敛到最优。
A. Derivation of the Policy Iteration Algorithm A. 政策迭代算法的推导
For the optimal control problems, the developed control scheme must not only stabilize the control systems, but also make the performance index function finite, i.e., the admissible control law [39]. 对于最优控制问题,所开发的控制方案不仅必须使控制系统稳定,还必须使性能指标函数有限,即可接受的控制律[39]。
Definition 3.1: A control law u(x_(k))u\left(x_{k}\right) is defined to be admissible with respect to (2) on Omega_(1)\Omega_{1} if u(x_(k))u\left(x_{k}\right) is continuous on Omega_(1)\Omega_{1}, u(0)=0,u(x_(k))u(0)=0, u\left(x_{k}\right) stabilizes (1) on Omega_(1)\Omega_{1}, and AAx_(0)inOmega_(1),J(x_(0))\forall x_{0} \in \Omega_{1}, J\left(x_{0}\right) is finite. 定义 3.1:控制律 u(x_(k))u\left(x_{k}\right) 被定义为相对于 (2) 在 Omega_(1)\Omega_{1} 上是可接受的,如果 u(x_(k))u\left(x_{k}\right) 在 Omega_(1)\Omega_{1} 上是连续的, u(0)=0,u(x_(k))u(0)=0, u\left(x_{k}\right) 在 Omega_(1)\Omega_{1} 上稳定 (1),并且 AAx_(0)inOmega_(1),J(x_(0))\forall x_{0} \in \Omega_{1}, J\left(x_{0}\right) 是有限的。
In the developed policy iteration algorithm, the performance index function and control law are updated by iterations, with the iteration index ii increasing from zero to infinity. Let v_(0)(x_(k))v_{0}\left(x_{k}\right) be an arbitrary admissible control law. For i=0i=0, let V_(0)(x_(k))V_{0}\left(x_{k}\right) be the iterative performance index function constructed by v_(0)(x_(k))v_{0}\left(x_{k}\right) that satisfies the following generalized HJB (GHJB) equation: 在开发的政策迭代算法中,性能指标函数和控制律通过迭代进行更新,迭代索引 ii 从零增加到无穷大。设 v_(0)(x_(k))v_{0}\left(x_{k}\right) 为任意可接受的控制律。对于 i=0i=0 ,设 V_(0)(x_(k))V_{0}\left(x_{k}\right) 为由 v_(0)(x_(k))v_{0}\left(x_{k}\right) 构造的迭代性能指标函数,该函数满足以下广义 HJB(GHJB)方程:
where x_(k+1)=F(x_(k),v_(0)(x_(k)))x_{k+1}=F\left(x_{k}, v_{0}\left(x_{k}\right)\right). Then, the iterative control law is computed by 在 x_(k+1)=F(x_(k),v_(0)(x_(k)))x_{k+1}=F\left(x_{k}, v_{0}\left(x_{k}\right)\right) 的地方。然后,迭代控制律通过以下方式计算:
For AA i=1,2,dots\forall i=1,2, \ldots, let V_(i)(x_(k))V_{i}\left(x_{k}\right) be the iterative performance index function constructed by v_(i)(x_(k))v_{i}\left(x_{k}\right), which satisfies the following GHJB equation: 对于 AA i=1,2,dots\forall i=1,2, \ldots ,令 V_(i)(x_(k))V_{i}\left(x_{k}\right) 为由 v_(i)(x_(k))v_{i}\left(x_{k}\right) 构造的迭代性能指标函数,满足以下 GHJB 方程:
From the policy iteration algorithm (7)-(10), we can see that the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right) is used to approximate J^(**)(x_(k))J^{*}\left(x_{k}\right) and the iterative control law v_(i)(x_(k))v_{i}\left(x_{k}\right) is used to approximate u^(**)(x_(k))u^{*}\left(x_{k}\right). As (9) is generally not the HJB equation, we have the iterative performance index function V_(i)(x_(k))!=J^(**)(x_(k))V_{i}\left(x_{k}\right) \neq J^{*}\left(x_{k}\right) and v_(i)(x_(k))!=u^(**)(x_(k))v_{i}\left(x_{k}\right) \neq u^{*}\left(x_{k}\right) for AA i=0,1,dots\forall i=0,1, \ldots Therefore, it is necessary to determine whether the algorithm is convergent, which means that V_(i)(x_(k))V_{i}\left(x_{k}\right) and v_(i)(x_(k))v_{i}\left(x_{k}\right) converge to the optimal ones as i rarr ooi \rightarrow \infty. In the following section, we will show the properties of the policy iteration algorithm. 从策略迭代算法 (7)-(10) 中,我们可以看到迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 用于逼近 J^(**)(x_(k))J^{*}\left(x_{k}\right) ,而迭代控制律 v_(i)(x_(k))v_{i}\left(x_{k}\right) 用于逼近 u^(**)(x_(k))u^{*}\left(x_{k}\right) 。由于 (9) 通常不是 HJB 方程,我们有迭代性能指标函数 V_(i)(x_(k))!=J^(**)(x_(k))V_{i}\left(x_{k}\right) \neq J^{*}\left(x_{k}\right) 和 v_(i)(x_(k))!=u^(**)(x_(k))v_{i}\left(x_{k}\right) \neq u^{*}\left(x_{k}\right) 用于 AA i=0,1,dots\forall i=0,1, \ldots 。因此,有必要确定算法是否收敛,这意味着 V_(i)(x_(k))V_{i}\left(x_{k}\right) 和 v_(i)(x_(k))v_{i}\left(x_{k}\right) 随着 i rarr ooi \rightarrow \infty 收敛到最优解。在接下来的部分中,我们将展示策略迭代算法的性质。
B. Properties of the Policy Iteration Algorithm B. 策略迭代算法的性质
For the policy iteration algorithm of continuous-time nonlinear systems [17], it shows that any of the iterative control laws can stabilize the system. This is a merit of the policy iteration algorithm. For the discrete-time systems, the stability 对于连续时间非线性系统的策略迭代算法[17],它表明任何迭代控制律都可以使系统稳定。这是策略迭代算法的一个优点。对于离散时间系统,稳定性
of the system can also be guaranteed under the iterative control law. 系统的稳定性也可以在迭代控制法则下得到保证。
Lemma 3.1: For i=0,1,dotsi=0,1, \ldots, let V_(i)(x_(k))V_{i}\left(x_{k}\right) and v_(i)(x_(k))v_{i}\left(x_{k}\right) be obtained by the policy iteration algorithm (7)-(10), where v_(0)(x_(k))v_{0}\left(x_{k}\right) is an arbitrary admissible control law. If Assumption 2.1 holds, then for AA i=0,1,dots\forall i=0,1, \ldots, the iterative control law v_(i)(x_(k))v_{i}\left(x_{k}\right) stabilizes the nonlinear system (1). 引理 3.1:对于 i=0,1,dotsi=0,1, \ldots ,让 V_(i)(x_(k))V_{i}\left(x_{k}\right) 和 v_(i)(x_(k))v_{i}\left(x_{k}\right) 通过策略迭代算法 (7)-(10) 获得,其中 v_(0)(x_(k))v_{0}\left(x_{k}\right) 是一个任意可接受的控制律。如果假设 2.1 成立,则对于 AA i=0,1,dots\forall i=0,1, \ldots ,迭代控制律 v_(i)(x_(k))v_{i}\left(x_{k}\right) 稳定非线性系统 (1)。
For continuous-time policy iteration algorithms [17], according to the derivative of the difference for the iterative performance index functions along with time variable tt, it is proven that the iterative performance index functions are nonincreasingly convergent to the optimal solution of the HJB equation. For discrete-time systems, the method for the continuous-time is invalid. Thus, a new convergence proof is established in this paper. We will show that using the policy iteration algorithm of discrete-time nonlinear systems, the iterative performance index functions are also nonincreasingly convergent to the optimum. 对于连续时间策略迭代算法[17],根据沿时间变量 tt 的迭代性能指标函数的差的导数,证明了迭代性能指标函数是非递增收敛到 HJB 方程的最优解。对于离散时间系统,连续时间的方法是无效的。因此,本文建立了一种新的收敛证明。我们将展示使用离散时间非线性系统的策略迭代算法,迭代性能指标函数同样是非递增收敛到最优解。
Theorem 3.1: For i=0,1,dotsi=0,1, \ldots, let V_(i)(x_(k))V_{i}\left(x_{k}\right) and v_(i)(x_(k))v_{i}\left(x_{k}\right) be obtained by (7)-(10). If Assumption 2.1 holds, then for AAx_(k)inR^(n)\forall x_{k} \in \mathbb{R}^{n}, the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right) is a monotonically nonincreasing sequence for AA i >= 0\forall i \geq 0 定理 3.1:对于 i=0,1,dotsi=0,1, \ldots ,设 V_(i)(x_(k))V_{i}\left(x_{k}\right) 和 v_(i)(x_(k))v_{i}\left(x_{k}\right) 由 (7)-(10) 得到。如果假设 2.1 成立,则对于 AAx_(k)inR^(n)\forall x_{k} \in \mathbb{R}^{n} ,迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 是一个单调不减的序列,对于 AA i >= 0\forall i \geq 0 。
Proof: First, for i=0,1,dotsi=0,1, \ldots, define a new iterative performance index function Gamma_(i+1)(x_(k))\Gamma_{i+1}\left(x_{k}\right) as 证明:首先,对于 i=0,1,dotsi=0,1, \ldots ,定义一个新的迭代性能指标函数 Gamma_(i+1)(x_(k))\Gamma_{i+1}\left(x_{k}\right) 为
where v_(i+1)(x_(k))v_{i+1}\left(x_{k}\right) is obtained by (10). According to (9), (10), and (12), for AAx_(k)\forall x_{k}, we can obtain 其中 v_(i+1)(x_(k))v_{i+1}\left(x_{k}\right) 是通过 (10) 获得的。根据 (9)、(10) 和 (12),对于 AAx_(k)\forall x_{k} ,我们可以得到
Inequality (11) will be proven by mathematical induction. According to Lemma 3.1, for i=0,1,dotsi=0,1, \ldots, we have that v_(i+1)(x_(k))v_{i+1}\left(x_{k}\right) is a stable control law. Then, we have x_(k)rarr0x_{k} \rightarrow 0, for AA k rarr oo\forall k \rightarrow \infty. Without loss of generality, let x_(N)=0x_{N}=0, where N rarr ooN \rightarrow \infty. We can obtain 不等式 (11) 将通过数学归纳法证明。根据引理 3.1,对于 i=0,1,dotsi=0,1, \ldots ,我们有 v_(i+1)(x_(k))v_{i+1}\left(x_{k}\right) 是一个稳定的控制律。然后,对于 AA k rarr oo\forall k \rightarrow \infty ,我们有 x_(k)rarr0x_{k} \rightarrow 0 。不失一般性,设 x_(N)=0x_{N}=0 ,其中 N rarr ooN \rightarrow \infty 。我们可以得到
Therefore, the conclusion holds for k=N-1k=N-1. Assume that the conclusion holds for k=l+1,l=0,1,dotsk=l+1, l=0,1, \ldots For k=lk=l, we can obtain 因此,该结论适用于 k=N-1k=N-1 。假设该结论适用于 k=l+1,l=0,1,dotsk=l+1, l=0,1, \ldots 。对于 k=lk=l ,我们可以得到
According to (17) and (18), we can obtain that for i=0,1,dotsi=0,1, \ldots, (11) holds for AAx_(k)\forall x_{k}. The proof is completed. 根据(17)和(18),我们可以得出对于 i=0,1,dotsi=0,1, \ldots ,(11)在 AAx_(k)\forall x_{k} 中成立。证明完成。
Corollary 3.1: For i=0,1,dotsi=0,1, \ldots, let V_(i)(x_(k))V_{i}\left(x_{k}\right) and v_(i)(x_(k))v_{i}\left(x_{k}\right) be obtained by (7)-(10), where v_(0)(x_(k))v_{0}\left(x_{k}\right) is an arbitrary admissible control law. If Assumption 2.1 holds, then for AA i=0,1,dots\forall i=0,1, \ldots, the iterative control law v_(i)(x_(k))v_{i}\left(x_{k}\right) is admissible. 推论 3.1:对于 i=0,1,dotsi=0,1, \ldots ,令 V_(i)(x_(k))V_{i}\left(x_{k}\right) 和 v_(i)(x_(k))v_{i}\left(x_{k}\right) 通过 (7)-(10) 获得,其中 v_(0)(x_(k))v_{0}\left(x_{k}\right) 是任意可接受的控制律。如果假设 2.1 成立,则对于 AA i=0,1,dots\forall i=0,1, \ldots ,迭代控制律 v_(i)(x_(k))v_{i}\left(x_{k}\right) 是可接受的。
From Theorem 3.1, we know that the iterative performance index function V_(i)(x_(k)) >= 0V_{i}\left(x_{k}\right) \geq 0 is monotonically nonincreasing and bounded below for iteration index i=1,2,dotsi=1,2, \ldots Hence, we can derive the following theorem. 根据定理 3.1,我们知道迭代性能指标函数 V_(i)(x_(k)) >= 0V_{i}\left(x_{k}\right) \geq 0 对于迭代索引 i=1,2,dotsi=1,2, \ldots 是单调非增且下界有界。因此,我们可以推导出以下定理。
Theorem 3.2: For i=0,1,dotsi=0,1, \ldots, let V_(i)(x_(k))V_{i}\left(x_{k}\right) and v_(i)(x_(k))v_{i}\left(x_{k}\right) be obtained by (7)-(10). If Assumption 2.1 holds, then we have that the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right) converges to the optimal performance index function J^(**)(x_(k))J^{*}\left(x_{k}\right), as i rarr ooi \rightarrow \infty 定理 3.2:对于 i=0,1,dotsi=0,1, \ldots ,设 V_(i)(x_(k))V_{i}\left(x_{k}\right) 和 v_(i)(x_(k))v_{i}\left(x_{k}\right) 由 (7)-(10) 得到。如果假设 2.1 成立,则迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 收敛到最优性能指标函数 J^(**)(x_(k))J^{*}\left(x_{k}\right) ,当 i rarr ooi \rightarrow \infty 。
which satisfies the HJB equation (4). 满足 HJB 方程(4)。
Proof: The statement can be proven by the following three steps. 证明:该陈述可以通过以下三个步骤证明。
Show that the limit of the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right) satisfies the HJB equation, as i rarr ooi \rightarrow \infty. 显示迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 的极限满足 HJB 方程,如 i rarr ooi \rightarrow \infty 所示。
According to Theorem 3.1, as V_(i)(x_(k))V_{i}\left(x_{k}\right) is nonincreasing and lower bounded sequence, we have that the limit the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right) exists as i rarr ooi \rightarrow \infty. Define the performance index function V_(oo)(x_(k))V_{\infty}\left(x_{k}\right) as the limit of the iterative function V_(i)(x_(k))V_{i}\left(x_{k}\right) 根据定理 3.1,由于 V_(i)(x_(k))V_{i}\left(x_{k}\right) 是一个单调不增且下界有界的序列,我们可以得出迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 的极限存在为 i rarr ooi \rightarrow \infty 。将性能指标函数 V_(oo)(x_(k))V_{\infty}\left(x_{k}\right) 定义为迭代函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 的极限。
According to the definition of Gamma_(i+1)(x_(k))\Gamma_{i+1}\left(x_{k}\right) in (12), we have 根据(12)中 Gamma_(i+1)(x_(k))\Gamma_{i+1}\left(x_{k}\right) 的定义,我们有
Let epsi > 0\varepsilon>0 be an arbitrary positive number. Since V_(i)(x_(k))V_{i}\left(x_{k}\right) is nonincreasing for i >= 1i \geq 1 and lim_(i rarr oo)V_(i)(x_(k))=V_(oo)(x_(k))\lim _{i \rightarrow \infty} V_{i}\left(x_{k}\right)=V_{\infty}\left(x_{k}\right), there exists a positive integer pp such that 让 epsi > 0\varepsilon>0 为任意正数。由于 V_(i)(x_(k))V_{i}\left(x_{k}\right) 在 i >= 1i \geq 1 和 lim_(i rarr oo)V_(i)(x_(k))=V_(oo)(x_(k))\lim _{i \rightarrow \infty} V_{i}\left(x_{k}\right)=V_{\infty}\left(x_{k}\right) 上是非递增的,因此存在一个正整数 pp ,使得
Next, let mu(x_(k))\mu\left(x_{k}\right) be an arbitrary admissible control law, and define a new performance index function P(x_(k))P\left(x_{k}\right), which satisfies 接下来,让 mu(x_(k))\mu\left(x_{k}\right) 成为一个任意可接受的控制律,并定义一个新的性能指标函数 P(x_(k))P\left(x_{k}\right) ,该函数满足
Then, we can declare the second step of the proof. 然后,我们可以声明证明的第二步。
2) Show that for an arbitrary admissible control law mu(x_(k))\mu\left(x_{k}\right), the performance index function V_(oo)(x_(k)) <= P(x_(k))V_{\infty}\left(x_{k}\right) \leq P\left(x_{k}\right). 2) 证明对于任意可接受的控制律 mu(x_(k))\mu\left(x_{k}\right) ,性能指标函数 V_(oo)(x_(k)) <= P(x_(k))V_{\infty}\left(x_{k}\right) \leq P\left(x_{k}\right) 。
The statement can be proven by mathematical induction. As mu(x_(k))\mu\left(x_{k}\right) is an admissible control law, we have x_(k)rarr0x_{k} \rightarrow 0 as k rarr ook \rightarrow \infty. Without loss of generality, let x_(N)=0x_{N}=0, where N rarr ooN \rightarrow \infty. According to (28), we have 该陈述可以通过数学归纳法证明。由于 mu(x_(k))\mu\left(x_{k}\right) 是一个可接受的控制律,我们有 x_(k)rarr0x_{k} \rightarrow 0 作为 k rarr ook \rightarrow \infty 。不失一般性,设 x_(N)=0x_{N}=0 ,其中 N rarr ooN \rightarrow \infty 。根据(28),我们有
where x_(N)=0x_{N}=0. According to (27), the iterative performance index function V_(oo)(x_(k))V_{\infty}\left(x_{k}\right) can be expressed as 在 x_(N)=0x_{N}=0 。根据(27),迭代性能指标函数 V_(oo)(x_(k))V_{\infty}\left(x_{k}\right) 可以表示为
{:[V_(oo)(x_(k))=lim_(N rarr oo){U(x_(k),v_(oo)(x_(k)))+U(x_(k+1),v_(oo)(x_(k+1))):}],[{:+cdots+U(x_(N-1),v_(oo)(x_(N-1)))+V_(oo)(x_(N))}],[=lim_(N rarr oo){min_(u_(k)){U(x_(k),u_(k))+min_(u_(k+1)){U(x_(k+1),u_(k+1))+cdots:}],[{:+min_(u_(N-1)){U(x_(N-1),u_(N-1))+V_(oo)(x_(N))}}}}.]:}\begin{aligned}
& V_{\infty}\left(x_{k}\right)=\lim _{N \rightarrow \infty}\left\{U\left(x_{k}, v_{\infty}\left(x_{k}\right)\right)+U\left(x_{k+1}, v_{\infty}\left(x_{k+1}\right)\right)\right. \\
& \left.+\cdots+U\left(x_{N-1}, v_{\infty}\left(x_{N-1}\right)\right)+V_{\infty}\left(x_{N}\right)\right\} \\
& =\lim _{N \rightarrow \infty}\left\{\operatorname { m i n } _ { u _ { k } } \left\{U\left(x_{k}, u_{k}\right)+\min _{u_{k+1}}\left\{U\left(x_{k+1}, u_{k+1}\right)+\cdots\right.\right.\right. \\
& \left.\left.\left.+\min _{u_{N-1}}\left\{U\left(x_{N-1}, u_{N-1}\right)+V_{\infty}\left(x_{N}\right)\right\}\right\}\right\}\right\} .
\end{aligned}
According to Corollary 3.1 , for AA i=0,1,dots\forall i=0,1, \ldots, since v_(i)(x_(k))v_{i}\left(x_{k}\right) is an admissible control law, we have that v_(oo)(x_(k))=v_{\infty}\left(x_{k}\right)=lim_(i rarr oo)v_(i)(x_(k))\lim _{i \rightarrow \infty} v_{i}\left(x_{k}\right) is an admissible control law. Then, we can get x_(N)=0x_{N}=0, where N rarr ooN \rightarrow \infty, which means V_(oo)(x_(N))=P(x_(N))=0V_{\infty}\left(x_{N}\right)=P\left(x_{N}\right)=0. For N-1N-1, according to (27), we can obtain 根据推论 3.1,对于 AA i=0,1,dots\forall i=0,1, \ldots ,由于 v_(i)(x_(k))v_{i}\left(x_{k}\right) 是一个可接受的控制律,我们有 v_(oo)(x_(k))=v_{\infty}\left(x_{k}\right)=lim_(i rarr oo)v_(i)(x_(k))\lim _{i \rightarrow \infty} v_{i}\left(x_{k}\right) 是一个可接受的控制律。然后,我们可以得到 x_(N)=0x_{N}=0 ,其中 N rarr ooN \rightarrow \infty ,这意味着 V_(oo)(x_(N))=P(x_(N))=0V_{\infty}\left(x_{N}\right)=P\left(x_{N}\right)=0 。对于 N-1N-1 ,根据(27),我们可以得到
Assume that the statement holds for k=l+1,l=0,1,dotsk=l+1, l=0,1, \ldots Then, for k=lk=l, we have 假设该陈述对 k=l+1,l=0,1,dotsk=l+1, l=0,1, \ldots 成立。那么,对于 k=lk=l ,我们有
holds. Mathematical induction is completed. 持有。数学归纳法已完成。
3) Show that the performance index function V_(oo)(x_(k))V_{\infty}\left(x_{k}\right) is equal to the optimal performance index function J^(**)(x_(k))J^{*}\left(x_{k}\right). 3) 证明性能指标函数 V_(oo)(x_(k))V_{\infty}\left(x_{k}\right) 等于最优性能指标函数 J^(**)(x_(k))J^{*}\left(x_{k}\right) 。
According to the definition of J^(**)(x_(k))J^{*}\left(x_{k}\right) in (3), for AA i=0,1,dots\forall i=0,1, \ldots, we have 根据(3)中 J^(**)(x_(k))J^{*}\left(x_{k}\right) 的定义,对于 AA i=0,1,dots\forall i=0,1, \ldots ,我们有
On the other hand, for an arbitrary admissible control law mu(x_(k))\mu\left(x_{k}\right), we have (33) holds. Let mu(x_(k))=u^(**)(x_(k))\mu\left(x_{k}\right)=u^{*}\left(x_{k}\right), where u^(**)(x_(k))u^{*}\left(x_{k}\right) is an optimal control law. Then, we can obtain 另一方面,对于任意可接受的控制律 mu(x_(k))\mu\left(x_{k}\right) ,我们有(33)成立。设 mu(x_(k))=u^(**)(x_(k))\mu\left(x_{k}\right)=u^{*}\left(x_{k}\right) ,其中 u^(**)(x_(k))u^{*}\left(x_{k}\right) 是最优控制律。然后,我们可以得到
According to (35) and (36), we can obtain (19). The proof is completed. 根据(35)和(36),我们可以得到(19)。证明完成。
Corollary 3.2: Let x_(k)inR^(n)x_{k} \in \mathbb{R}^{n} be an arbitrary state vector. If Theorem 3.2 holds, then we have that the iterative control law v_(i)(x_(k))v_{i}\left(x_{k}\right) converges to the optimal control law as i rarr ooi \rightarrow \infty, i.e., u^(**)(x_(k))=lim_(i rarr oo)v_(i)(x_(k))u^{*}\left(x_{k}\right)=\lim _{i \rightarrow \infty} v_{i}\left(x_{k}\right). 推论 3.2:设 x_(k)inR^(n)x_{k} \in \mathbb{R}^{n} 为任意状态向量。如果定理 3.2 成立,则我们有迭代控制律 v_(i)(x_(k))v_{i}\left(x_{k}\right) 收敛于最优控制律,当 i rarr ooi \rightarrow \infty 时,即 u^(**)(x_(k))=lim_(i rarr oo)v_(i)(x_(k))u^{*}\left(x_{k}\right)=\lim _{i \rightarrow \infty} v_{i}\left(x_{k}\right) 。
Remark 3.1: From Theorems 3.1 and 3.2, we have proven that the iterative performance index functions are monotonically nonincreasing and convergent to the optimal performance index function as i rarr ooi \rightarrow \infty. The stability of the nonlinear systems can also be guaranteed under the iterative control laws. Hence, we can say that the convergence and stability properties of the policy iteration algorithms for continuous-time nonlinear systems are also valid for the policy iteration algorithms for discrete-time nonlinear systems. However, we emphasize that the analysis techniques between the continuous- and discrete-time policy iteration algorithms are quite different. First, the HJB equations of continuousand discrete-time systems are different inherently. Second, the analysis method for continuous-time policy iteration algorithm is based on derivative techniques [17], [43]-[47]. The analysis method for continuous-time policy iteration algorithm is invalid for discrete-time situation. In this paper, for the first time, we have established the properties for the discrete-time policy iteration algorithms, which established the theoretical basis of the applications for the optimal control problem of discrete-time nonlinear systems using policy iteration algorithms. Hence, we say that the discussion in this paper is meaningful. 备注 3.1:根据定理 3.1 和 3.2,我们已经证明迭代性能指标函数是单调非增的,并且在 i rarr ooi \rightarrow \infty 时收敛到最优性能指标函数。在迭代控制律下,非线性系统的稳定性也可以得到保证。因此,我们可以说,连续时间非线性系统的策略迭代算法的收敛性和稳定性特性同样适用于离散时间非线性系统的策略迭代算法。然而,我们强调,连续时间和离散时间策略迭代算法之间的分析技术是截然不同的。首先,连续时间和离散时间系统的 HJB 方程本质上是不同的。其次,连续时间策略迭代算法的分析方法基于导数技术 [17],[43]-[47]。连续时间策略迭代算法的分析方法在离散时间情况下是无效的。 在本文中,我们首次建立了离散时间策略迭代算法的性质,这为使用策略迭代算法解决离散时间非线性系统的最优控制问题奠定了理论基础。因此,我们认为本文的讨论是有意义的。
Remark 3.2: In [39], value iteration algorithm of ADP is proposed to obtain the optimal solution of the HJB equation for the following discrete-time affine nonlinear systems: 备注 3.2:在 [39] 中,提出了自适应动态规划的值迭代算法,以获得以下离散时间仿射非线性系统的 HJB 方程的最优解:
J(x_(k))=sum_(j=k)^(oo)(x_(j)^(T)Qx_(j)+u_(j)^(T)Ru_(j))J\left(x_{k}\right)=\sum_{j=k}^{\infty}\left(x_{j}^{T} Q x_{j}+u_{j}^{T} R u_{j}\right)
where QQ and RR are defined as the penalizing matrices for system state and control vectors, respectively. Let Q inR^(n xx n)Q \in \mathbb{R}^{n \times n} and R inR^(m xx m)R \in \mathbb{R}^{m \times m} be both positive definite matrices. Starting 其中 QQ 和 RR 分别定义为系统状态和控制向量的惩罚矩阵。设 Q inR^(n xx n)Q \in \mathbb{R}^{n \times n} 和 R inR^(m xx m)R \in \mathbb{R}^{m \times m} 都是正定矩阵。开始
from V_(0)(x_(k))-=0V_{0}\left(x_{k}\right) \equiv 0, the value iteration algorithm can be expressed as 从 V_(0)(x_(k))-=0V_{0}\left(x_{k}\right) \equiv 0 开始,值迭代算法可以表示为
It was proven in [39] that V_(i)(x_(k))V_{i}\left(x_{k}\right) is a monotonically nondecreasing sequence and bounded, and hence converges to J^(**)(x_(k))J^{*}\left(x_{k}\right). However, using the value iteration equation (39), the stability of the system under the iterative control law cannot be guaranteed. We should point out that using the policy iteration algorithm (7)-(10), the stability property can be guaranteed. In Section V, the numerical results and comparisons between the policy and value iteration algorithms will be presented to show these properties. 在[39]中证明了 V_(i)(x_(k))V_{i}\left(x_{k}\right) 是一个单调非减的有界序列,因此收敛到 J^(**)(x_(k))J^{*}\left(x_{k}\right) 。然而,使用值迭代方程(39),在迭代控制法下系统的稳定性无法得到保证。我们应该指出,使用策略迭代算法(7)-(10)可以保证稳定性属性。在第五节中,将展示数值结果以及策略和值迭代算法之间的比较,以显示这些属性。
Hence, we say that if an admissible control of the nonlinear system is known, then policy iteration algorithm is preferred to obtain the optimal control law with good convergence and stability properties. From this point of view, the initial admissible control law is a key to success for the policy iteration algorithm. In the following section, an effective method will be discussed to obtain the initial admissible control law using neural networks. 因此,我们说如果已知非线性系统的可接受控制,则优先使用策略迭代算法来获得具有良好收敛性和稳定性特性的最优控制律。从这个角度来看,初始可接受控制律是策略迭代算法成功的关键。在接下来的部分,将讨论一种有效的方法,通过神经网络获得初始可接受控制律。
C. Obtaining the Initial Admissible Control Law C. 获取初始可接受控制律
As the policy iteration algorithm requires to start with an admissible control law, the method of obtaining the admissible control law is important to the implementation of the algorithm. Actually, for all the policy iteration algorithms of ADP, including [17] and [43]-[47], the initial admissible control law is necessary to implement their algorithms. Unfortunately, to the best of our knowledge, there is no theory on how to design the initial admissible control law. In this paper, we will give an effective method to obtain the initial admissible control law by the experiments using neural networks. 由于策略迭代算法需要从一个可接受的控制律开始,因此获取可接受控制律的方法对算法的实现至关重要。实际上,对于所有的自适应动态规划(ADP)策略迭代算法,包括[17]和[43]-[47],初始可接受控制律是实现其算法所必需的。不幸的是,据我们所知,目前尚无关于如何设计初始可接受控制律的理论。在本文中,我们将通过使用神经网络的实验提供一种有效的方法来获取初始可接受控制律。
First, let Psi(x_(k)) >= 0\Psi\left(x_{k}\right) \geq 0 be an arbitrary semipositive definite function. For i=0,1,dotsi=0,1, \ldots, we define a new performance index function as 首先,让 Psi(x_(k)) >= 0\Psi\left(x_{k}\right) \geq 0 是一个任意的半正定函数。对于 i=0,1,dotsi=0,1, \ldots ,我们定义一个新的性能指标函数为
where Phi_(0)(x_(k))=Psi(x_(k))\Phi_{0}\left(x_{k}\right)=\Psi\left(x_{k}\right). Then, we can derive the following theorem. 在 Phi_(0)(x_(k))=Psi(x_(k))\Phi_{0}\left(x_{k}\right)=\Psi\left(x_{k}\right) 的地方。然后,我们可以推导出以下定理。
Theorem 3.3: Suppose Assumption 2.1 holds. Let Psi(x_(k)) >= 0\Psi\left(x_{k}\right) \geq 0 be an arbitrary semipositive definite function. Let mu(x_(k))\mu\left(x_{k}\right) be an arbitrary control law for (1), which satisfies mu(0)=0\mu(0)=0. Define the iterative performance index function Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) as (40), where Phi_(0)(x_(k))=Psi(x_(k))\Phi_{0}\left(x_{k}\right)=\Psi\left(x_{k}\right). Then, mu(x_(k))\mu\left(x_{k}\right) is an admissible control law if and only if the limit of Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) exists as i rarr ooi \rightarrow \infty. 定理 3.3:假设假设 2.1 成立。设 Psi(x_(k)) >= 0\Psi\left(x_{k}\right) \geq 0 为任意半正定函数。设 mu(x_(k))\mu\left(x_{k}\right) 为 (1) 的任意控制律,满足 mu(0)=0\mu(0)=0 。定义迭代性能指标函数 Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) 为 (40),其中 Phi_(0)(x_(k))=Psi(x_(k))\Phi_{0}\left(x_{k}\right)=\Psi\left(x_{k}\right) 。则,当且仅当 Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) 的极限存在时, mu(x_(k))\mu\left(x_{k}\right) 是可接受的控制律。
Proof: We first prove the sufficiency of the statement. Assume that mu(x_(k))\mu\left(x_{k}\right) is an admissible control law. According to (40), we have 证明:我们首先证明该陈述的充分性。假设 mu(x_(k))\mu\left(x_{k}\right) 是一个可接受的控制律。根据 (40),我们有
Algorithm 1 Obtain the Initial Admissible Control Law 算法 1 获取初始可接受控制律
Initialization: 初始化:
Choose a semi-positive definite function Psi(x_(k)) >= 0\Psi\left(x_{k}\right) \geq 0; 选择一个半正定函数 Psi(x_(k)) >= 0\Psi\left(x_{k}\right) \geq 0 ;
Initialize two neural networks (critic networks for brief) 初始化两个神经网络(简要称为评论网络)
cnet 1 and cnet 2 with small random weights; cnet 1 和 cnet 2 具有小的随机权重;
Let Phi_(0)(x_(k))=Psi(x_(k))\Phi_{0}\left(x_{k}\right)=\Psi\left(x_{k}\right); 让 Phi_(0)(x_(k))=Psi(x_(k))\Phi_{0}\left(x_{k}\right)=\Psi\left(x_{k}\right) ;
Give the max iteration of computation i_("max ")i_{\text {max }}. 给出计算的最大迭代次数 i_("max ")i_{\text {max }} 。
Iteration: 迭代:
1: Establish a neural network (action network for brief) with small random weights to generate an initial control law mu(x_(k))\mu\left(x_{k}\right) with mu(x_(k))=0\mu\left(x_{k}\right)=0 for x_(k)=0x_{k}=0; 建立一个具有小随机权重的神经网络(简要称为行动网络),以生成初始控制律 mu(x_(k))\mu\left(x_{k}\right) ,用于 mu(x_(k))=0\mu\left(x_{k}\right)=0 的 x_(k)=0x_{k}=0 ;
2: Let i=0i=0. Train the critic network cnet 1 to approximate Phi_(1)(x_(k))\Phi_{1}\left(x_{k}\right), where Phi_(1)(x_(k))\Phi_{1}\left(x_{k}\right) satisfies 2: 让 i=0i=0 。训练评论网络 cnet 1 以近似 Phi_(1)(x_(k))\Phi_{1}\left(x_{k}\right) ,其中 Phi_(1)(x_(k))\Phi_{1}\left(x_{k}\right) 满足
5: Use cnet1 to get Phi_(i+1)(x_(k))\Phi_{i+1}\left(x_{k}\right) and use cnet 2 to get Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right). If |Phi_(i+1)(x_(k))-Phi_(i)(x_(k))| < epsi\left|\Phi_{i+1}\left(x_{k}\right)-\Phi_{i}\left(x_{k}\right)\right|<\varepsilon, then goto Step 7. Else goto next step; 使用 cnet1 获取 Phi_(i+1)(x_(k))\Phi_{i+1}\left(x_{k}\right) ,使用 cnet2 获取 Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) 。如果 |Phi_(i+1)(x_(k))-Phi_(i)(x_(k))| < epsi\left|\Phi_{i+1}\left(x_{k}\right)-\Phi_{i}\left(x_{k}\right)\right|<\varepsilon ,则转到第 7 步。否则转到下一步;
: If i > i_(max)i>i_{\max }, then goto Step 1. Else goto Step 3; return mu(x_(k))\mu\left(x_{k}\right) and let v_(0)(x_(k))=mu(x_(k))v_{0}\left(x_{k}\right)=\mu\left(x_{k}\right). 如果 i > i_(max)i>i_{\max } ,则转到步骤 1。否则转到步骤 3;返回 mu(x_(k))\mu\left(x_{k}\right) 并让 v_(0)(x_(k))=mu(x_(k))v_{0}\left(x_{k}\right)=\mu\left(x_{k}\right) 。
Then, according to (41), we can obtain 然后,根据(41),我们可以得到
If mu(x_(k))\mu\left(x_{k}\right) is an admissible control law, then we have that sum_(j=0)^(oo)U(x_(k+j),mu(x_(k+j)))\sum_{j=0}^{\infty} U\left(x_{k+j}, \mu\left(x_{k+j}\right)\right) is finite. As for an arbitrary finite x_(k),Psi(x_(k))x_{k}, \Psi\left(x_{k}\right) is finite, then for AA i=0,1,2,dots\forall i=0,1,2, \ldots, we have that Phi_(i+1)(x_(k))\Phi_{i+1}\left(x_{k}\right) is finite. Hence, lim_(i rarr oo)Phi_(i)(x_(k))\lim _{i \rightarrow \infty} \Phi_{i}\left(x_{k}\right) is finite, which means Phi_(i+1)(x_(k))=Phi_(i)(x_(k))\Phi_{i+1}\left(x_{k}\right)=\Phi_{i}\left(x_{k}\right), as i rarr ooi \rightarrow \infty. 如果 mu(x_(k))\mu\left(x_{k}\right) 是一个可接受的控制律,那么我们有 sum_(j=0)^(oo)U(x_(k+j),mu(x_(k+j)))\sum_{j=0}^{\infty} U\left(x_{k+j}, \mu\left(x_{k+j}\right)\right) 是有限的。至于任意有限的 x_(k),Psi(x_(k))x_{k}, \Psi\left(x_{k}\right) 是有限的,那么对于 AA i=0,1,2,dots\forall i=0,1,2, \ldots ,我们有 Phi_(i+1)(x_(k))\Phi_{i+1}\left(x_{k}\right) 是有限的。因此, lim_(i rarr oo)Phi_(i)(x_(k))\lim _{i \rightarrow \infty} \Phi_{i}\left(x_{k}\right) 是有限的,这意味着 Phi_(i+1)(x_(k))=Phi_(i)(x_(k))\Phi_{i+1}\left(x_{k}\right)=\Phi_{i}\left(x_{k}\right) ,如同 i rarr ooi \rightarrow \infty 。
On the other hand, if the limit of Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) exists as i rarr ooi \rightarrow \infty, according to (42)-(44), as Psi(x_(k))\Psi\left(x_{k}\right) is finite, then we can get that sum_(j=0)^(oo)U(x_(k+j),mu(x_(k+j)))\sum_{j=0}^{\infty} U\left(x_{k+j}, \mu\left(x_{k+j}\right)\right) is finite. Since the utility function U(x_(k),u_(k))U\left(x_{k}, u_{k}\right) is positive definite for AAx_(k),u_(k)\forall x_{k}, u_{k}, then we can obtain U(x_(k+j),mu(x_(k+j)))rarr0U\left(x_{k+j}, \mu\left(x_{k+j}\right)\right) \rightarrow 0 as j rarr ooj \rightarrow \infty. As mu(x_(k))=0\mu\left(x_{k}\right)=0 for x_(k)=0x_{k}=0, we can get that x_(k)rarr0x_{k} \rightarrow 0 as k rarr ook \rightarrow \infty, which means that (1) is stable and mu(x_(k))\mu\left(x_{k}\right) is an admissible control law. The necessity of the statement is proven and the proof is completed. 另一方面,如果 Phi_(i)(x_(k))\Phi_{i}\left(x_{k}\right) 的极限存在为 i rarr ooi \rightarrow \infty ,根据(42)-(44),由于 Psi(x_(k))\Psi\left(x_{k}\right) 是有限的,那么我们可以得出 sum_(j=0)^(oo)U(x_(k+j),mu(x_(k+j)))\sum_{j=0}^{\infty} U\left(x_{k+j}, \mu\left(x_{k+j}\right)\right) 是有限的。由于效用函数 U(x_(k),u_(k))U\left(x_{k}, u_{k}\right) 对 AAx_(k),u_(k)\forall x_{k}, u_{k} 是正定的,因此我们可以得到 U(x_(k+j),mu(x_(k+j)))rarr0U\left(x_{k+j}, \mu\left(x_{k+j}\right)\right) \rightarrow 0 为 j rarr ooj \rightarrow \infty 。由于 mu(x_(k))=0\mu\left(x_{k}\right)=0 对于 x_(k)=0x_{k}=0 ,我们可以得出 x_(k)rarr0x_{k} \rightarrow 0 为 k rarr ook \rightarrow \infty ,这意味着(1)是稳定的, mu(x_(k))\mu\left(x_{k}\right) 是一个可接受的控制律。该陈述的必要性已被证明,证明完成。
Choose randomly an array of initial states x_(0)x_{0}; 随机选择一个初始状态数组 x_(0)x_{0} ;
Choose a computation precision epsi\varepsilon; 选择计算精度 epsi\varepsilon ;
Give the initial admissible control law v_(0)(x_(k))v_{0}\left(x_{k}\right); 给出初始可接受控制律 v_(0)(x_(k))v_{0}\left(x_{k}\right) ;
Give the max iteration of computation i_(max)i_{\max }. 给出计算的最大迭代次数 i_(max)i_{\max } 。
Iteration: 迭代:
Let the iteration index i=0i=0; 让迭代索引 i=0i=0 ;
Construct the iterative performance index function V_(0)(x_(k))V_{0}\left(x_{k}\right) according to v_(0)(x_(k))v_{0}\left(x_{k}\right) by 根据 v_(0)(x_(k))v_{0}\left(x_{k}\right) 构建迭代性能指标函数 V_(0)(x_(k))V_{0}\left(x_{k}\right)
3: Let i=i+1i=i+1. Construct the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right), which satisfies the following GHJB equation 3:让 i=i+1i=i+1 。构造迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) ,满足以下 GHJB 方程。
5: If V_(i-1)(x_(k))-V_(i)(x_(k)) < epsiV_{i-1}\left(x_{k}\right)-V_{i}\left(x_{k}\right)<\varepsilon, goto Step 7. Else goto Step 6; 6: If i < i_(max)i<i_{\max }, then goto Step 3. Else, goto Step 8 ; 5: 如果 V_(i-1)(x_(k))-V_(i)(x_(k)) < epsiV_{i-1}\left(x_{k}\right)-V_{i}\left(x_{k}\right)<\varepsilon ,则跳转到步骤 7。否则跳转到步骤 6;6: 如果 i < i_(max)i<i_{\max } ,则跳转到步骤 3。否则,跳转到步骤 8;
return v_(i)(x_(k))v_{i}\left(x_{k}\right) and V_(i)(x_(k))V_{i}\left(x_{k}\right). The optimal control law is achieved; 返回 v_(i)(x_(k))v_{i}\left(x_{k}\right) 和 V_(i)(x_(k))V_{i}\left(x_{k}\right) 。最优控制律已实现;
return The optimal control law is not achieved within i_(max)i_{\max } iterations. 返回最优控制律在 i_(max)i_{\max } 次迭代内未能实现。
According to Theorem 3.3, we can establish an effective iteration algorithm by repeating experiments using neural networks. The detailed implementation of the iteration algorithm is expressed in Algorithm 1. 根据定理 3.3,我们可以通过使用神经网络重复实验来建立有效的迭代算法。迭代算法的详细实现表达在算法 1 中。
Remark 3.3: We can see that the previously outlined training procedure can be easily implemented by computer program. Creating an action network, the weights of the critic networks cnet 1 and cnet 2 can be updated iteratively. If the weights of the critic network are convergent, then mu(x_(k))\mu\left(x_{k}\right) must be an admissible control law. As the weights of actin network are chosen randomly, the admissibility of the control law is unknown before the weights of the critic networks are convergent. Hence, the iteration process of Algorithm 1 is implemented offline. For convenience of implementation, in this paper, we choose Psi(x_(k))-=0\Psi\left(x_{k}\right) \equiv 0 to obtain the initial admissible control law. The detailed training method of neural networks is shown in the following section. 备注 3.3:我们可以看到,之前概述的训练过程可以很容易地通过计算机程序实现。创建一个动作网络后,评论网络 cnet 1 和 cnet 2 的权重可以迭代更新。如果评论网络的权重是收敛的,那么 mu(x_(k))\mu\left(x_{k}\right) 必须是一个可接受的控制律。由于动作网络的权重是随机选择的,因此在评论网络的权重收敛之前,控制律的可接受性是未知的。因此,算法 1 的迭代过程是在离线进行的。为了方便实现,在本文中,我们选择 Psi(x_(k))-=0\Psi\left(x_{k}\right) \equiv 0 来获得初始可接受的控制律。神经网络的详细训练方法将在以下部分中展示。
D. Summary of the Policy Iteration Algorithm of ADP D. 自适应动态规划(ADP)政策迭代算法的总结
According to the above preparations, we can summarize the discrete-time policy iteration algorithm in Algorithm 2. 根据上述准备,我们可以在算法 2 中总结离散时间策略迭代算法。
Fig. 1. Structure diagram of the algorithm. 图 1. 算法结构图。
IV. NEural NETwORK IMPLEMENTATION FOR THE Optimal CONTROL ScHEME IV. 神经网络在最优控制方案中的实现
In this paper, BP neural networks are used to approximate v_(i)(x_(k))v_{i}\left(x_{k}\right) and V_(i)(x_(k))V_{i}\left(x_{k}\right), respectively. The number of hidden layer neurons is denoted by ll, the weight matrix between the input layer and hidden layer is denoted by YY, and the weight matrix between the hidden layer and output layer is denoted by WW. Then, the output of three-layer neural network is represented by 在本文中,BP 神经网络分别用于逼近 v_(i)(x_(k))v_{i}\left(x_{k}\right) 和 V_(i)(x_(k))V_{i}\left(x_{k}\right) 。隐层神经元的数量用 ll 表示,输入层与隐层之间的权重矩阵用 YY 表示,隐层与输出层之间的权重矩阵用 WW 表示。然后,三层神经网络的输出表示为
hat(F)(X,Y,W)=W^(T)sigma(Y^(T)X+b)\hat{F}(X, Y, W)=W^{T} \sigma\left(Y^{T} X+b\right)
where sigma(Y^(T)X)inR^(l),[sigma(z)]_(i)=((e^(z_(i))-e^(-z_(i)))//(e^(z_(i))+e^(-z_(i))))\sigma\left(Y^{T} X\right) \in R^{l},[\sigma(z)]_{i}=\left(\left(e^{z_{i}}-e^{-z_{i}}\right) /\left(e^{z_{i}}+e^{-z_{i}}\right)\right), i=1,dots,li=1, \ldots, l are the activation functions and bb is the threshold value. 其中 sigma(Y^(T)X)inR^(l),[sigma(z)]_(i)=((e^(z_(i))-e^(-z_(i)))//(e^(z_(i))+e^(-z_(i))))\sigma\left(Y^{T} X\right) \in R^{l},[\sigma(z)]_{i}=\left(\left(e^{z_{i}}-e^{-z_{i}}\right) /\left(e^{z_{i}}+e^{-z_{i}}\right)\right) 、 i=1,dots,li=1, \ldots, l 是激活函数, bb 是阈值。
Here, there are two networks, which are critic network and action network, respectively. Both the neural networks are chosen as three-layer feedforward neural network. The whole structure diagram is shown in Fig. 1. 这里有两个网络,分别是评论网络和动作网络。两个神经网络都选择为三层前馈神经网络。整个结构图如图 1 所示。
A. Critic Network A. 评论网络
The critic network is used to approximate the performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right). The output of the critic network is denoted as 评论网络用于近似性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 。评论网络的输出表示为
where Z_(c)(k)=Y_(c)^(T)x_(k)+b_(c)Z_{c}(k)=Y_{c}^{T} x_{k}+b_{c}. The target function is expressed in (9). Then, we define the error function for the critic network as 在 Z_(c)(k)=Y_(c)^(T)x_(k)+b_(c)Z_{c}(k)=Y_{c}^{T} x_{k}+b_{c} 处。目标函数在(9)中表示。然后,我们为评论网络定义误差函数为
where alpha_(c) > 0\alpha_{c}>0 is the learning rate of critic network. If the training precision is achieved, then we say that V_(i+1)(x_(k))V_{i+1}\left(x_{k}\right) can be approximated by the critic network. 其中 alpha_(c) > 0\alpha_{c}>0 是评论网络的学习率。如果达到了训练精度,那么我们可以说 V_(i+1)(x_(k))V_{i+1}\left(x_{k}\right) 可以被评论网络近似。
There is an important property we should point out. Usually, the iterative performance index function V_(i)(x_(k))V_{i}\left(x_{k}\right) is difficult to obtain directly as the critic network is not trained yet. In this situation, there are two methods to obtain the target function of V_(i)(x_(k))V_{i}\left(x_{k}\right). First, define a new iteration index l=0,1,dotsl=0,1, \ldots and let the iterative performance index function be expressed as V_(i,l)(x_(k))V_{i, l}\left(x_{k}\right). Then, let V_(i,l)(x_(k))V_{i, l}\left(x_{k}\right) be updated by 有一个重要的属性我们应该指出。通常,迭代性能指标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 很难直接获得,因为评论网络尚未训练。在这种情况下,有两种方法可以获得目标函数 V_(i)(x_(k))V_{i}\left(x_{k}\right) 。首先,定义一个新的迭代指标 l=0,1,dotsl=0,1, \ldots ,并让迭代性能指标函数表示为 V_(i,l)(x_(k))V_{i, l}\left(x_{k}\right) 。然后,让 V_(i,l)(x_(k))V_{i, l}\left(x_{k}\right) 通过以下方式更新:
where hat(V)_(i,0)(x_(k+1))=V_(i-1)(x_(k+1))\hat{V}_{i, 0}\left(x_{k+1}\right)=V_{i-1}\left(x_{k+1}\right) is the output of the critic network in the previous iteration. The error function of the critic neural network can be expressed as 其中 hat(V)_(i,0)(x_(k+1))=V_(i-1)(x_(k+1))\hat{V}_{i, 0}\left(x_{k+1}\right)=V_{i-1}\left(x_{k+1}\right) 是前一次迭代中评论网络的输出。评论神经网络的误差函数可以表示为
e_(ci,l)(k)= hat(V)_(i,l+1)(x_(k))-V_(i,l+1)(x_(k))e_{c i, l}(k)=\hat{V}_{i, l+1}\left(x_{k}\right)-V_{i, l+1}\left(x_{k}\right)
Updating the neural networks according to (48) and (49) until e_(ci,l)(k)rarr0e_{c i, l}(k) \rightarrow 0. Let l=l+1l=l+1 and repeat (50) and (51), until the following holds: 根据(48)和(49)更新神经网络,直到 e_(ci,l)(k)rarr0e_{c i, l}(k) \rightarrow 0 。让 l=l+1l=l+1 并重复(50)和(51),直到以下条件成立:
Then, we have V_(i)(x_(k))=V_(i,l)(x_(k))V_{i}\left(x_{k}\right)=V_{i, l}\left(x_{k}\right). This method is simple. However, usually, (52) holds for l rarr ool \rightarrow \infty, i.e., 然后,我们有 V_(i)(x_(k))=V_(i,l)(x_(k))V_{i}\left(x_{k}\right)=V_{i, l}\left(x_{k}\right) 。这个方法很简单。然而,通常情况下,(52) 对于 l rarr ool \rightarrow \infty 成立,即:
Thus, the amount of computation is very large and it will take a long time to make the weights of the critic network converge. 因此,计算量非常大,评论网络的权重收敛将需要很长时间。
The other method can be expressed as follows. According to (9), we have 另一种方法可以表达如下。根据(9),我们有
As v_(i)(x_(k))v_{i}\left(x_{k}\right) is an admissible control law, we have V_(i)(x_(k+N))rarr0V_{i}\left(x_{k+N}\right) \rightarrow 0 for N rarr ooN \rightarrow \infty. Thus, choose a large enough NN and we have 由于 v_(i)(x_(k))v_{i}\left(x_{k}\right) 是一个可接受的控制律,我们有 V_(i)(x_(k+N))rarr0V_{i}\left(x_{k+N}\right) \rightarrow 0 对于 N rarr ooN \rightarrow \infty 。因此,选择一个足够大的 NN ,我们有
Then, the error function (47) can be obtained. We can see that this method is much easier to apply. For convenience of computation, the second method is used in this paper. 然后,可以得到误差函数(47)。我们可以看到,这种方法更容易应用。为了方便计算,本文采用第二种方法。
B. Action Network B. 行动网络
In the action network, the state error x_(k)x_{k} is used as input to create the iterative control law as the output of the network. The output can be formulated as 在动作网络中,状态误差 x_(k)x_{k} 被用作输入,以生成网络的输出迭代控制律。输出可以表述为
where Z_(a)(k)=Y_(a)^(T)x_(k)+b_(a)Z_{a}(k)=Y_{a}^{T} x_{k}+b_{a}. The target of the output of the action network is given by (10). Thus, we can define the output error of the action network as 在 Z_(a)(k)=Y_(a)^(T)x_(k)+b_(a)Z_{a}(k)=Y_{a}^{T} x_{k}+b_{a} 处。动作网络的输出目标由(10)给出。因此,我们可以将动作网络的输出误差定义为
The weights updating algorithm is similar to the one for the critic network. By the gradient descent rule, we can obtain 权重更新算法类似于评论网络的算法。通过梯度下降规则,我们可以得到
where beta_(a) > 0\beta_{a}>0 is the learning rate of action network. If the training precision is achieved, then we say that the iterative control law v_(i)(x_(k))v_{i}\left(x_{k}\right) can be approximated by the action network. 其中 beta_(a) > 0\beta_{a}>0 是动作网络的学习率。如果达到了训练精度,那么我们可以说迭代控制律 v_(i)(x_(k))v_{i}\left(x_{k}\right) 可以通过动作网络进行近似。
In this paper, to enhance the convergence speed of the neural networks, only one layer of neural network is updated during the training procedure. To guarantee the effectiveness of the neural network implementation, the convergence of the neural network weights is proven, which makes the iterative performance index function and iterative control be approximated by the critic and action networks, respectively. The weights convergence property of the neural networks is shown in the following theorem. 在本文中,为了提高神经网络的收敛速度,在训练过程中仅更新一层神经网络。为了保证神经网络实现的有效性,证明了神经网络权重的收敛性,这使得迭代性能指标函数和迭代控制分别由评估网络和动作网络进行近似。神经网络的权重收敛特性在以下定理中展示。
Theorem 4.1: Let the target performance index function and the target iterative control law be expressed by 定理 4.1:设目标性能指标函数和目标迭代控制律表示为
respectively. Let the critic and action networks be trained by (49) and (58), respectively. If the learning rates alpha_(c)\alpha_{c} and beta_(a)\beta_{a} are both small enough, then we have the critic weights W_(ci)(k)W_{c i}(k) and action network weights W_(ai)(k)W_{a i}(k) are asymptotically convergent to the optimal weights W_(ci)^(**)(k)W_{c i}^{*}(k) and W_(ai)^(**)(k)W_{a i}^{*}(k), respectively. 分别。让评论网络和动作网络分别通过(49)和(58)进行训练。如果学习率 alpha_(c)\alpha_{c} 和 beta_(a)\beta_{a} 都足够小,那么评论权重 W_(ci)(k)W_{c i}(k) 和动作网络权重 W_(ai)(k)W_{a i}(k) 将渐近收敛到最优权重 W_(ci)^(**)(k)W_{c i}^{*}(k) 和 W_(ai)^(**)(k)W_{a i}^{*}(k) ,分别。
Proof: Let bar(W)_(ci)^(j)=W_(ci)^(j)-W_(ci)^(**)\bar{W}_{c i}^{j}=W_{c i}^{j}-W_{c i}^{*} and bar(W)_(ai)^(j)=W_(ai)^(j)-W_(ai)^(**)\bar{W}_{a i}^{j}=W_{a i}^{j}-W_{a i}^{*}. From (49) and (58), we have 证明:设 bar(W)_(ci)^(j)=W_(ci)^(j)-W_(ci)^(**)\bar{W}_{c i}^{j}=W_{c i}^{j}-W_{c i}^{*} 和 bar(W)_(ai)^(j)=W_(ai)^(j)-W_(ai)^(**)\bar{W}_{a i}^{j}=W_{a i}^{j}-W_{a i}^{*} 。根据(49)和(58),我们有