Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems
离散时间非线性系统的策略迭代自适应动态规划算法

Derong Liu, Fellow, IEEE, and Qinglai Wei, Member, IEEE
刘德荣，IEEE 研究员，魏庆来，IEEE 会员

Abstract 摘要

This paper is concerned with a new discrete-time policy iteration adaptive dynamic programming (ADP) method for solving the infinite horizon optimal control problem of nonlinear systems. The idea is to use an iterative ADP technique to obtain the iterative control law, which optimizes the iterative performance index function. The main contribution of this paper is to analyze the convergence and stability properties of policy iteration method for discrete-time nonlinear systems for the first time. It shows that the iterative performance index function is nonincreasingly convergent to the optimal solution of the Hamilton-Jacobi-Bellman equation. It is also proven that any of the iterative control laws can stabilize the nonlinear systems. Neural networks are used to approximate the performance index function and compute the optimal control law, respectively, for facilitating the implementation of the iterative ADP algorithm, where the convergence of the weight matrices is analyzed. Finally, the numerical results and analysis are presented to illustrate the performance of the developed method.
本文关注一种新的离散时间策略迭代自适应动态规划（ADP）方法，用于解决非线性系统的无限时域最优控制问题。其思路是使用迭代 ADP 技术获得迭代控制律，从而优化迭代性能指标函数。本文的主要贡献是首次分析离散时间非线性系统的策略迭代方法的收敛性和稳定性特性。结果表明，迭代性能指标函数是非递增地收敛于哈密顿-雅可比-贝尔曼方程的最优解。还证明了任何迭代控制律都可以使非线性系统稳定。神经网络被用来近似性能指标函数并计算最优控制律，以便于实现迭代 ADP 算法，并分析权重矩阵的收敛性。最后，提供了数值结果和分析，以说明所开发方法的性能。

Index Terms-Adaptive critic designs, adaptive dynamic programming (ADP), approximate dynamic programming, discrete-time policy iteration, neural networks, neurodynamic programming, nonlinear systems, optimal control, reinforcement learning.
索引词：自适应评论设计，自适应动态规划（ADP），近似动态规划，离散时间策略迭代，神经网络，神经动态规划，非线性系统，最优控制，强化学习。

I. InTRODUCTION 一. 引言

DYNAMIC programming is a very useful tool in solving optimal control problems. However, due to the curse of dimensionality [1], it is often computationally untenable to run dynamic programming for obtaining the optimal solution. The adaptive/approximate dynamic programming (ADP) algorithms were proposed in [2] and [3] as a way to solve optimal control problems forward in time and gained much attention from the researchers [4]-[12]. There are several synonyms used for ADP including adaptive critic designs [13]-[15], ADP [16]-[19], approximate dynamic programming [20], [21], neural dynamic programming [22], neurodynamic programming [23], and reinforcement learning [24]-[26]. In [14] and [21], ADP approaches were classified into several main schemes, which include heuristic dynamic
动态规划是解决最优控制问题的一个非常有用的工具。然而，由于维度诅咒[1]，运行动态规划以获得最优解在计算上往往不可行。自适应/近似动态规划（ADP）算法在[2]和[3]中被提出，作为一种向前解决最优控制问题的方法，并受到研究人员的广泛关注[4]-[12]。ADP 有几个同义词，包括自适应评论设计[13]-[15]、ADP[16]-[19]、近似动态规划[20]、[21]、神经动态规划[22]、神经动态编程[23]和强化学习[24]-[26]。在[14]和[21]中，ADP 方法被分类为几种主要方案，包括启发式动态。

programming (HDP), action-dependent HDP, also known as Q-learning [27], dual heuristic dynamic programming (DHP), action-dependent DHP, globalized DHP (GDHP), and actiondependent GDHP. Iterative methods are also used in ADP to obtain the solution of the Hamilton-Jacobi-Bellman (HJB) equation indirectly and have received more and more attention [28]-[35]. There are two main iterative ADP algorithms, which are value iteration ADP (value iteration for brief) algorithm and policy iteration ADP (policy iteration for brief) algorithm [36], [37].
编程（HDP）、依赖于动作的 HDP，也称为 Q 学习[27]、双启发式动态规划（DHP）、依赖于动作的 DHP、全球化 DHP（GDHP）和依赖于动作的 GDHP。迭代方法也用于 ADP，以间接获得汉密尔顿-雅可比-贝尔曼（HJB）方程的解，并受到越来越多的关注[28]-[35]。主要有两种迭代 ADP 算法，即值迭代 ADP（简称值迭代）算法和策略迭代 ADP（简称策略迭代）算法[36]，[37]。
Value iteration algorithm for optimal control of discretetime nonlinear systems was given in [38]. In 2008, Al-Tamimi et al. [39] studied value iteration algorithm for deterministic discrete-time affine nonlinear systems, which was referred to as HDP and was proposed for finding the optimal control law. Starting from a zero initial performance index function, it is proven in [39] that the iterative performance index function is a nondecreasing sequence and bounded. When the iteration index increases to infinity, the iterative performance index function converges to the optimal performance index function, which satisfies the HJB equation. In 2008, Zhang et al. [40] applied value iteration ADP to solve optimal tracking control problems for nonlinear systems. Liu et al. [16] realized the value iteration ADP by GDHP. For the value iteration algorithm of ADP, the initial performance index function is required to be zero [16], [18], [39]-[42]. On the other hand, for value iteration algorithm, the stability of the system under the iterative control law cannot be guaranteed. This means that only the converged optimal control law can be used to control the nonlinear system, and all the iterative controls during the iteration procedure may be invalid. Therefore, the computation efficiency of the value iteration algorithm of ADP is very low. Till now, all the value iteration algorithms are implemented offline, which limit their practical applications.
离散时间非线性系统的最优控制值迭代算法在[38]中给出。2008 年，Al-Tamimi 等人[39]研究了确定性离散时间仿射非线性系统的值迭代算法，该算法被称为 HDP，旨在寻找最优控制律。从零初始性能指标函数出发，[39]中证明了迭代性能指标函数是一个非递减序列且有界。当迭代索引增加到无穷大时，迭代性能指标函数收敛到满足 HJB 方程的最优性能指标函数。2008 年，Zhang 等人[40]应用值迭代 ADP 解决非线性系统的最优跟踪控制问题。Liu 等人[16]通过 GDHP 实现了值迭代 ADP。对于 ADP 的值迭代算法，初始性能指标函数要求为零[16]，[18]，[39]-[42]。另一方面，对于值迭代算法，无法保证在迭代控制律下系统的稳定性。这意味着只有收敛的最优控制律可以用来控制非线性系统，而在迭代过程中所有的迭代控制可能都是无效的。因此，ADP 的值迭代算法的计算效率非常低。到目前为止，所有的值迭代算法都是离线实现的，这限制了它们的实际应用。

Policy iteration algorithm for continuous-time systems was proposed in [17]. Murray et al. [17] proved that for continuous-time affine nonlinear systems, the iterative performance index function will converge to the optimum nonincreasingly and each of the iterative controls stabilizes the nonlinear systems using policy iteration algorithms. This is a great merit of policy iteration algorithm and hence achieved many applications for solving optimal control problems of nonlinear systems. In 2005, Abu-Khalaf and Lewis [43] proposed a policy iteration algorithm for continuous-time nonlinear system with control constraints. Zhang et al. [44] applied policy iteration algorithm to solve continuous-time nonlinear two-person zero-sum games.
在[17]中提出了连续时间系统的策略迭代算法。Murray 等人[17]证明，对于连续时间仿射非线性系统，迭代性能指标函数将以非递增方式收敛到最优，并且每个迭代控制都能通过策略迭代算法使非线性系统稳定。这是策略迭代算法的一大优点，因此在解决非线性系统的最优控制问题上取得了许多应用。2005 年，Abu-Khalaf 和 Lewis[43]提出了一种用于具有控制约束的连续时间非线性系统的策略迭代算法。Zhang 等人[44]应用策略迭代算法解决连续时间非线性两人零和博弈。

Vamvoudakis et al. [45] proposed a multiagent differential graphical games for continuous-time linear systems using policy iteration algorithms. Bhasin et al. [46] proposed an online actor-critic-identifier architecture to obtain the optimal control law for uncertain nonlinear systems by policy iteration algorithms. Till now, nearly all the online iterative ADP algorithms are policy iteration algorithms. However, almost all the discussions on the policy iteration algorithms are based on continuous-time control systems [44]-[47]. The discussions on policy iteration algorithms for discrete-time control systems are scarce. Only in [36] and [37] a brief sketch of policy iteration algorithm for discrete-time systems was displayed, whereas the stability and convergence properties were not discussed. To the best of our knowledge, there are still no discussions focused on the policy iteration algorithms for discrete-time systems, which motives our research.
Vamvoudakis 等人 [45] 提出了用于连续时间线性系统的多智能体微分图形游戏，采用策略迭代算法。Bhasin 等人 [46] 提出了在线演员-评论家-识别器架构，通过策略迭代算法为不确定非线性系统获得最优控制律。到目前为止，几乎所有在线迭代自适应动态规划（ADP）算法都是策略迭代算法。然而，几乎所有关于策略迭代算法的讨论都是基于连续时间控制系统 [44]-[47]。关于离散时间控制系统的策略迭代算法的讨论很少。只有在 [36] 和 [37] 中简要展示了离散时间系统的策略迭代算法，而稳定性和收敛性特性并未讨论。根据我们所知，目前仍然没有专注于离散时间系统的策略迭代算法的讨论，这激励了我们的研究。

In this paper, it is the first time that the properties of policy iteration algorithm of ADP for discrete-time nonlinear systems are analyzed. First, the policy iteration algorithm is introduced to find the optimal control law. Second, it will show that any of the iterative control laws can stabilize the nonlinear system. Third, it will show that the iterative control laws using policy iteration algorithm can make the iterative performance index functions converge to the optimal solution monotonically. Next, an effective method is developed to obtain the initial admissible control law by repeating experiments. Furthermore, to facilitate the implementation of the policy iteration algorithm, it will show how to use neural networks for implementing the developed policy iteration algorithm to obtain the iterative performance index functions and the iterative control laws. The weight convergence properties of the neural networks are also established. In numerical examples, comparisons will be displayed to show the effectiveness of the developed algorithm. For linear systems, the control results by the policy iteration algorithm will be compared with the ones by the algebraic Riccati equation (ARE) method to justify the correctness of the developed method. For nonlinear systems, the control results by the policy iteration algorithm will be compared with the ones by value iteration algorithm to show the effectiveness of the developed algorithm.
在本文中，首次分析了离散时间非线性系统的自适应动态规划（ADP）策略迭代算法的性质。首先，引入策略迭代算法以寻找最优控制律。其次，将展示任何迭代控制律都可以稳定非线性系统。第三，将表明使用策略迭代算法的迭代控制律可以使迭代性能指标函数单调收敛到最优解。接下来，开发了一种有效的方法，通过重复实验获得初始可接受控制律。此外，为了便于实施策略迭代算法，将展示如何使用神经网络来实现所开发的策略迭代算法，以获得迭代性能指标函数和迭代控制律。还建立了神经网络的权重收敛性质。在数值示例中，将展示比较结果，以显示所开发算法的有效性。对于线性系统，策略迭代算法的控制结果将与代数里卡提方程（ARE）方法的结果进行比较，以验证所开发方法的正确性。对于非线性系统，策略迭代算法的控制结果将与价值迭代算法的结果进行比较，以展示所开发算法的有效性。

The rest of this paper is organized as follows. In Section II, the problem formulation is presented. In Section III, the policy iteration algorithm for discrete-time nonlinear systems is derived. The stability, convergence. and optimality properties will also be proven in this section. In Section IV, the neural network implementation for the optimal control scheme is discussed. In Section V, the numerical results and analyses are presented to demonstrate the effectiveness of the developed optimal control scheme. Finally, in Section VI, the conclusion is drawn and our future work is declared.
本文的其余部分组织如下。第二节介绍了问题的表述。第三节推导了离散时间非线性系统的策略迭代算法。本节还将证明稳定性、收敛性和最优性特性。第四节讨论了最优控制方案的神经网络实现。第五节展示了数值结果和分析，以证明所开发的最优控制方案的有效性。最后，第六节总结了结论并阐明了我们的未来工作。

II. Problem Formulation II. 问题表述

In this paper, we will study the following deterministic discrete-time systems:
在本文中，我们将研究以下确定性离散时间系统：

x_{k + 1} = F (x_{k}, u_{k}) k = 0, 1, 2, \dots

where

x_{k} \in R^{n}

is the

n

-dimensional state vector and

u_{k} \in R^{m}

is the

m

-dimensional control vector. Let

x_{0}

be the initial state and

F (x_{k}, u_{k})

be the system function.
其中

x_{k} \in R^{n}

是

n

维状态向量，

u_{k} \in R^{m}

是

m

维控制向量。设

x_{0}

为初始状态，

F (x_{k}, u_{k})

为系统函数。

Let

{\underset{―}{u}}_{k} = {u_{k}, u_{k + 1}, \dots}

be an arbitrary sequence of controls from

k

\infty

. The performance index function for state

x_{0}

under the control sequence

{\underset{―}{u}}_{0} = {u_{0}, u_{1}, \dots}

is defined as
让

{\underset{―}{u}}_{k} = {u_{k}, u_{k + 1}, \dots}

成为从

k

到

\infty

的任意控制序列。在控制序列

{\underset{―}{u}}_{0} = {u_{0}, u_{1}, \dots}

下状态

x_{0}

的性能指标函数定义为

J (x_{0}, {\underset{―}{u}}_{0}) = \sum_{k = 0}^{\infty} U (x_{k}, u_{k})

where

U (x_{k}, u_{k}) > 0

, for

\forall x_{k}, u_{k} \neq 0

, is the utility function.
在这里，

U (x_{k}, u_{k}) > 0

对于

\forall x_{k}, u_{k} \neq 0

是效用函数。
We will study optimal control problems for (1). The goal of this paper is to find an optimal control scheme, which stabilizes (1) and simultaneously minimizes the performance index function (2). For convenience of analysis, results of this paper are based on the following assumption.
我们将研究（1）的最优控制问题。本文的目标是找到一个最优控制方案，既能稳定（1），又能同时最小化性能指标函数（2）。为了方便分析，本文的结果基于以下假设。

Assumption 2.1: System (1) is controllable; the system state

x_{k} = 0

is an equilibrium state of (1) under the control

u_{k} = 0

, i.e.,

F (0, 0) = 0

; the feedback control

u_{k} = u (x_{k})

satisfies

u_{k} = u (x_{k}) = 0

for

x_{k} = 0

; the utility function

U (x_{k}, u_{k})

is a positive definite function for any

x_{k}

and

u_{k}

.
假设 2.1：系统 (1) 是可控的；系统状态

x_{k} = 0

是在控制

u_{k} = 0

下 (1) 的平衡状态，即

F (0, 0) = 0

；反馈控制

u_{k} = u (x_{k})

满足

u_{k} = u (x_{k}) = 0

对于

x_{k} = 0

；效用函数

U (x_{k}, u_{k})

对于任何

x_{k}

和

u_{k}

是正定函数。

As (1) is controllable, there exists a stable control sequence

{\underset{―}{u}}_{k} = {u_{k}, u_{k + 1} \dots}

that moves

x_{k}

to zero. Let

{\underset{―}{A}}_{k}

denote the set, which contains all the stable control sequences. Then, the optimal performance index function can be defined as
由于 (1) 是可控的，存在一个稳定的控制序列

{\underset{―}{u}}_{k} = {u_{k}, u_{k + 1} \dots}

，将

x_{k}

移动到零。令

{\underset{―}{A}}_{k}

表示包含所有稳定控制序列的集合。然后，最优性能指标函数可以定义为

J^{*} (x_{k}) = inf {J (x_{k}, {\underset{―}{u}}_{k}) : {\underset{―}{u}}_{k} \in {\underset{―}{A}}_{k}}

According to Bellman’s principle of optimality,

J^{*} (x_{k})

satisfies the following discrete-time HJB equation:
根据贝尔曼最优性原理，

J^{*} (x_{k})

满足以下离散时间 HJB 方程：

J^{*} (x_{k}) = inf_{u_{k}} {U (x_{k}, u_{k}) + J^{*} (F (x_{k}, u_{k}))}

Define the law of optimal control as
定义最优控制法则为

u^{*} (x_{k}) = \arg inf_{u_{k}} {U (x_{k}, u_{k}) + J^{*} (F (x_{k}, u_{k}))}

Hence, the HJB equation (4) can be written as
因此，HJB 方程 (4) 可以写成

J^{*} (x_{k}) = U (x_{k}, u^{*} (x_{k})) + J^{*} (F (x_{k}, u^{*} (x_{k})))

We can see that if we want to obtain the optimal control law

u^{*} (x_{k})

, we must obtain the optimal performance index function

J^{*} (x_{k})

. Generally,

J^{*} (x_{k})

is unknown before all the controls

u_{k} \in R^{n}

are considered. If we adopt the traditional dynamic programming method to obtain the optimal performance index function at every time step, then we have to face the curse of dimensionality. Furthermore, the optimal control is discussed in infinite horizon. This means the length of the control sequence is infinite, which makes the optimal control nearly impossible to obtain by the HJB equation (6). To overcome this difficulty, a new iterative algorithm based on ADP is developed.
我们可以看到，如果我们想获得最优控制律

u^{*} (x_{k})

，就必须获得最优性能指标函数

J^{*} (x_{k})

。一般来说，在考虑所有控制

u_{k} \in R^{n}

之前，

J^{*} (x_{k})

是未知的。如果我们采用传统的动态规划方法在每个时间步获得最优性能指标函数，那么我们就必须面对维度诅咒。此外，最优控制是在无限时间范围内讨论的。这意味着控制序列的长度是无限的，这使得通过 HJB 方程(6)几乎不可能获得最优控制。为了克服这个困难，开发了一种基于 ADP 的新迭代算法。

III. Policy ItERation Algorithm
III. 政策迭代算法

In this section, the discrete-time policy iteration algorithm of ADP is developed to obtain the optimal controller for nonlinear systems. The goal of the developed policy iteration algorithm is to construct an iterative control law

v_{i} (x_{k})

, which moves an arbitrary initial state

x_{0}

to the equilibrium 0 , and
在本节中，开发了自适应动态规划的离散时间策略迭代算法，以获得非线性系统的最优控制器。所开发的策略迭代算法的目标是构造一个迭代控制律

v_{i} (x_{k})

，将任意初始状态

x_{0}

移动到平衡点 0。
simultaneously makes the iterative performance index function to reach the optimum. Stability proofs will be given to show that any of the iterative controls can stabilize the nonlinear system. Convergence and optimality proofs will also be given to show that the performance index functions will converge to the optimum.
同时使迭代性能指标函数达到最优。将提供稳定性证明，以表明任何迭代控制都可以使非线性系统稳定。还将提供收敛性和最优性证明，以表明性能指标函数将收敛到最优。

A. Derivation of the Policy Iteration Algorithm
A. 政策迭代算法的推导

For the optimal control problems, the developed control scheme must not only stabilize the control systems, but also make the performance index function finite, i.e., the admissible control law [39].
对于最优控制问题，所开发的控制方案不仅必须使控制系统稳定，还必须使性能指标函数有限，即可接受的控制律[39]。

Definition 3.1: A control law

u (x_{k})

is defined to be admissible with respect to (2) on

Ω_{1}

u (x_{k})

is continuous on

Ω_{1}

u (0) = 0, u (x_{k})

stabilizes (1) on

Ω_{1}

, and

\forall x_{0} \in Ω_{1}, J (x_{0})

is finite.
定义 3.1：控制律

u (x_{k})

被定义为相对于 (2) 在

Ω_{1}

上是可接受的，如果

u (x_{k})

在

Ω_{1}

上是连续的，

u (0) = 0, u (x_{k})

在

Ω_{1}

上稳定 (1)，并且

\forall x_{0} \in Ω_{1}, J (x_{0})

是有限的。

In the developed policy iteration algorithm, the performance index function and control law are updated by iterations, with the iteration index

i

increasing from zero to infinity. Let

v_{0} (x_{k})

be an arbitrary admissible control law. For

i = 0

, let

V_{0} (x_{k})

be the iterative performance index function constructed by

v_{0} (x_{k})

that satisfies the following generalized HJB (GHJB) equation:
在开发的政策迭代算法中，性能指标函数和控制律通过迭代进行更新，迭代索引

i

从零增加到无穷大。设

v_{0} (x_{k})

为任意可接受的控制律。对于

i = 0

，设

V_{0} (x_{k})

为由

v_{0} (x_{k})

构造的迭代性能指标函数，该函数满足以下广义 HJB（GHJB）方程：

V_{0} (x_{k}) = U (x_{k}, v_{0} (x_{k})) + V_{0} (x_{k + 1})

where

x_{k + 1} = F (x_{k}, v_{0} (x_{k}))

. Then, the iterative control law is computed by
在

x_{k + 1} = F (x_{k}, v_{0} (x_{k}))

的地方。然后，迭代控制律通过以下方式计算：

\begin{aligned} v_{1} (x_{k}) & = \arg min_{u_{k}} {U (x_{k}, u_{k}) + V_{0} (x_{k + 1})} \\ = \arg min_{u_{k}} {U (x_{k}, u_{k}) + V_{0} (F (x_{k}, u_{k}))} \end{aligned}

For

\forall i = 1, 2, \dots

, let

V_{i} (x_{k})

be the iterative performance index function constructed by

v_{i} (x_{k})

, which satisfies the following GHJB equation:
对于

\forall i = 1, 2, \dots

，令

V_{i} (x_{k})

为由

v_{i} (x_{k})

构造的迭代性能指标函数，满足以下 GHJB 方程：

V_{i} (x_{k}) = U (x_{k}, v_{i} (x_{k})) + V_{i} (F (x_{k}, v_{i} (x_{k})))

and the iterative control law is updated by
并且迭代控制律通过以下方式更新

\begin{aligned} v_{i + 1} (x_{k}) & = \arg min_{u_{k}} {U (x_{k}, u_{k}) + V_{i} (x_{k + 1})} \\ = \arg min_{u_{k}} {U (x_{k}, u_{k}) + V_{i} (F (x_{k}, u_{k}))} \end{aligned}

From the policy iteration algorithm (7)-(10), we can see that the iterative performance index function

V_{i} (x_{k})

is used to approximate

J^{*} (x_{k})

and the iterative control law

v_{i} (x_{k})

is used to approximate

u^{*} (x_{k})

. As (9) is generally not the HJB equation, we have the iterative performance index function

V_{i} (x_{k}) \neq J^{*} (x_{k})

and

v_{i} (x_{k}) \neq u^{*} (x_{k})

for

\forall i = 0, 1, \dots

Therefore, it is necessary to determine whether the algorithm is convergent, which means that

V_{i} (x_{k})

and

v_{i} (x_{k})

converge to the optimal ones as

i \to \infty

. In the following section, we will show the properties of the policy iteration algorithm.
从策略迭代算法 (7)-(10) 中，我们可以看到迭代性能指标函数

V_{i} (x_{k})

用于逼近

J^{*} (x_{k})

，而迭代控制律

v_{i} (x_{k})

用于逼近

u^{*} (x_{k})

。由于 (9) 通常不是 HJB 方程，我们有迭代性能指标函数

V_{i} (x_{k}) \neq J^{*} (x_{k})

和

v_{i} (x_{k}) \neq u^{*} (x_{k})

用于

\forall i = 0, 1, \dots

。因此，有必要确定算法是否收敛，这意味着

V_{i} (x_{k})

和

v_{i} (x_{k})

随着

i \to \infty

收敛到最优解。在接下来的部分中，我们将展示策略迭代算法的性质。

B. Properties of the Policy Iteration Algorithm
B. 策略迭代算法的性质

For the policy iteration algorithm of continuous-time nonlinear systems [17], it shows that any of the iterative control laws can stabilize the system. This is a merit of the policy iteration algorithm. For the discrete-time systems, the stability
对于连续时间非线性系统的策略迭代算法[17]，它表明任何迭代控制律都可以使系统稳定。这是策略迭代算法的一个优点。对于离散时间系统，稳定性
of the system can also be guaranteed under the iterative control law.
系统的稳定性也可以在迭代控制法则下得到保证。

Lemma 3.1: For

i = 0, 1, \dots

, let

V_{i} (x_{k})

and

v_{i} (x_{k})

be obtained by the policy iteration algorithm (7)-(10), where

v_{0} (x_{k})

is an arbitrary admissible control law. If Assumption 2.1 holds, then for

\forall i = 0, 1, \dots

, the iterative control law

v_{i} (x_{k})

stabilizes the nonlinear system (1).
引理 3.1：对于

i = 0, 1, \dots

，让

V_{i} (x_{k})

和

v_{i} (x_{k})

通过策略迭代算法 (7)-(10) 获得，其中

v_{0} (x_{k})

是一个任意可接受的控制律。如果假设 2.1 成立，则对于

\forall i = 0, 1, \dots

，迭代控制律

v_{i} (x_{k})

稳定非线性系统 (1)。

For continuous-time policy iteration algorithms [17], according to the derivative of the difference for the iterative performance index functions along with time variable

t

, it is proven that the iterative performance index functions are nonincreasingly convergent to the optimal solution of the HJB equation. For discrete-time systems, the method for the continuous-time is invalid. Thus, a new convergence proof is established in this paper. We will show that using the policy iteration algorithm of discrete-time nonlinear systems, the iterative performance index functions are also nonincreasingly convergent to the optimum.
对于连续时间策略迭代算法[17]，根据沿时间变量

t

的迭代性能指标函数的差的导数，证明了迭代性能指标函数是非递增收敛到 HJB 方程的最优解。对于离散时间系统，连续时间的方法是无效的。因此，本文建立了一种新的收敛证明。我们将展示使用离散时间非线性系统的策略迭代算法，迭代性能指标函数同样是非递增收敛到最优解。

Theorem 3.1: For

i = 0, 1, \dots

, let

V_{i} (x_{k})

and

v_{i} (x_{k})

be obtained by (7)-(10). If Assumption 2.1 holds, then for

\forall x_{k} \in R^{n}

, the iterative performance index function

V_{i} (x_{k})

is a monotonically nonincreasing sequence for

\forall i \geq 0

定理 3.1：对于

i = 0, 1, \dots

，设

V_{i} (x_{k})

和

v_{i} (x_{k})

由 (7)-(10) 得到。如果假设 2.1 成立，则对于

\forall x_{k} \in R^{n}

，迭代性能指标函数

V_{i} (x_{k})

是一个单调不减的序列，对于

\forall i \geq 0

。

V_{i + 1} (x_{k}) \leq V_{i} (x_{k})

Proof: First, for

i = 0, 1, \dots

, define a new iterative performance index function

Γ_{i + 1} (x_{k})

as
证明：首先，对于

i = 0, 1, \dots

，定义一个新的迭代性能指标函数

Γ_{i + 1} (x_{k})

为

Γ_{i + 1} (x_{k}) = U (x_{k}, v_{i + 1} (x_{k})) + V_{i} (x_{k + 1})

where

v_{i + 1} (x_{k})

is obtained by (10). According to (9), (10), and (12), for

\forall x_{k}

, we can obtain
其中

v_{i + 1} (x_{k})

是通过 (10) 获得的。根据 (9)、(10) 和 (12)，对于

\forall x_{k}

，我们可以得到

Γ_{i + 1} (x_{k}) \leq V_{i} (x_{k})

Inequality (11) will be proven by mathematical induction. According to Lemma 3.1, for

i = 0, 1, \dots

, we have that

v_{i + 1} (x_{k})

is a stable control law. Then, we have

x_{k} \to 0

, for

\forall k \to \infty

. Without loss of generality, let

x_{N} = 0

, where

N \to \infty

. We can obtain
不等式 (11) 将通过数学归纳法证明。根据引理 3.1，对于

i = 0, 1, \dots

，我们有

v_{i + 1} (x_{k})

是一个稳定的控制律。然后，对于

\forall k \to \infty

，我们有

x_{k} \to 0

。不失一般性，设

x_{N} = 0

，其中

N \to \infty

。我们可以得到

V_{i + 1} (x_{N}) = Γ_{i + 1} (x_{N}) = V_{i} (x_{N}) = 0

First, we let

k = N - 1

. According to (10), we have
首先，我们让

k = N - 1

。根据(10)，我们有

\begin{aligned} v_{i + 1} (x_{N - 1}) & = \arg min_{u_{N - 1}} {U (x_{N - 1}, u_{N - 1}) + V_{i} (x_{N})} \\ = \arg min_{u_{N - 1}} {U (x_{N - 1}, u_{N - 1}) + V_{i + 1} (x_{N})} \end{aligned}

According to (9), we can obtain
根据（9），我们可以得到

\begin{aligned} V_{i + 1} (x_{N - 1}) & = U (x_{N - 1}, v_{i + 1} (x_{N - 1})) + V_{i + 1} (x_{N}) \\ = min_{u_{N - 1}} {U (x_{N - 1}, u_{N - 1}) + V_{i} (x_{N})} \\ \leq U (x_{N - 1}, v_{i} (x_{N - 1})) + V_{i} (x_{N}) \\ = V_{i} (x_{N - 1}) \end{aligned}

Therefore, the conclusion holds for

k = N - 1

. Assume that the conclusion holds for

k = l + 1, l = 0, 1, \dots

For

k = l

, we can obtain
因此，该结论适用于

k = N - 1

。假设该结论适用于

k = l + 1, l = 0, 1, \dots

。对于

k = l

，我们可以得到

\begin{aligned} V_{i + 1} (x_{l}) & = U (x_{l}, v_{i + 1} (x_{l})) + V_{i + 1} (x_{l + 1}) \\ \leq U (x_{l}, v_{i + 1} (x_{l})) + V_{i} (x_{l + 1}) \\ = Γ_{i + 1} (x_{l}) \end{aligned}

According to (13), for

\forall x_{l}

, we have
根据（13），对于

\forall x_{l}

，我们有

Γ_{i + 1} (x_{l}) \leq V_{i} (x_{l})

According to (17) and (18), we can obtain that for

i = 0, 1, \dots

, (11) holds for

\forall x_{k}

. The proof is completed.
根据（17）和（18），我们可以得出对于

i = 0, 1, \dots

，（11）在

\forall x_{k}

中成立。证明完成。

Corollary 3.1: For

i = 0, 1, \dots

, let

V_{i} (x_{k})

and

v_{i} (x_{k})

be obtained by (7)-(10), where

v_{0} (x_{k})

is an arbitrary admissible control law. If Assumption 2.1 holds, then for

\forall i = 0, 1, \dots

, the iterative control law

v_{i} (x_{k})

is admissible.
推论 3.1：对于

i = 0, 1, \dots

，令

V_{i} (x_{k})

和

v_{i} (x_{k})

通过 (7)-(10) 获得，其中

v_{0} (x_{k})

是任意可接受的控制律。如果假设 2.1 成立，则对于

\forall i = 0, 1, \dots

，迭代控制律

v_{i} (x_{k})

是可接受的。

From Theorem 3.1, we know that the iterative performance index function

V_{i} (x_{k}) \geq 0

is monotonically nonincreasing and bounded below for iteration index

i = 1, 2, \dots

Hence, we can derive the following theorem.
根据定理 3.1，我们知道迭代性能指标函数

V_{i} (x_{k}) \geq 0

对于迭代索引

i = 1, 2, \dots

是单调非增且下界有界。因此，我们可以推导出以下定理。

Theorem 3.2: For

i = 0, 1, \dots

, let

V_{i} (x_{k})

and

v_{i} (x_{k})

be obtained by (7)-(10). If Assumption 2.1 holds, then we have that the iterative performance index function

V_{i} (x_{k})

converges to the optimal performance index function

J^{*} (x_{k})

, as

i \to \infty

定理 3.2：对于

i = 0, 1, \dots

，设

V_{i} (x_{k})

和

v_{i} (x_{k})

由 (7)-(10) 得到。如果假设 2.1 成立，则迭代性能指标函数

V_{i} (x_{k})

收敛到最优性能指标函数

J^{*} (x_{k})

，当

i \to \infty

。

lim_{i \to \infty} V_{i} (x_{k}) = J^{*} (x_{k})

which satisfies the HJB equation (4).
满足 HJB 方程（4）。
Proof: The statement can be proven by the following three steps.
证明：该陈述可以通过以下三个步骤证明。

Show that the limit of the iterative performance index function $V_{i} (x_{k})$ satisfies the HJB equation, as $i \to \infty$ .
显示迭代性能指标函数 $V_{i} (x_{k})$ 的极限满足 HJB 方程，如 $i \to \infty$ 所示。

According to Theorem 3.1, as

V_{i} (x_{k})

is nonincreasing and lower bounded sequence, we have that the limit the iterative performance index function

V_{i} (x_{k})

exists as

i \to \infty

. Define the performance index function

V_{\infty} (x_{k})

as the limit of the iterative function

V_{i} (x_{k})

根据定理 3.1，由于

V_{i} (x_{k})

是一个单调不增且下界有界的序列，我们可以得出迭代性能指标函数

V_{i} (x_{k})

的极限存在为

i \to \infty

。将性能指标函数

V_{\infty} (x_{k})

定义为迭代函数

V_{i} (x_{k})

的极限。

V_{\infty} (x_{k}) = lim_{i \to \infty} V_{i} (x_{k})

According to the definition of

Γ_{i + 1} (x_{k})

in (12), we have
根据(12)中

Γ_{i + 1} (x_{k})

的定义，我们有

\begin{aligned} Γ_{i + 1} (x_{k}) & = U (x_{k}, v_{i + 1} (x_{k})) + V_{i} (x_{k + 1}) \\ = min_{u_{k}} {U (x_{k}, u_{k}) + V_{i} (x_{k + 1})} \end{aligned}

According to (17), we can obtain
根据（17），我们可以得到

V_{i + 1} (x_{k}) \leq Γ_{i + 1} (x_{k})

for

\forall x_{k}

. Let

i \to \infty

. We can obtain
对于

\forall x_{k}

。让

i \to \infty

。我们可以得到

\begin{aligned} V_{\infty} (x_{k}) & = lim_{i \to \infty} V_{i} (x_{k}) \leq V_{i + 1} (x_{k}) \leq Γ_{i + 1} (x_{k}) \\ \leq min_{u_{k}} {U (x_{k}, u_{k}) + V_{i} (x_{k + 1})} \end{aligned}

Thus, we can obtain 因此，我们可以得到

V_{\infty} (x_{k}) \leq min_{u_{k}} {U (x_{k}, u_{k}) + V_{\infty} (x_{k + 1})}

Let

ε > 0

be an arbitrary positive number. Since

V_{i} (x_{k})

is nonincreasing for

i \geq 1

and

lim_{i \to \infty} V_{i} (x_{k}) = V_{\infty} (x_{k})

, there exists a positive integer

p

such that
让

ε > 0

为任意正数。由于

V_{i} (x_{k})

在

i \geq 1

和

lim_{i \to \infty} V_{i} (x_{k}) = V_{\infty} (x_{k})

上是非递增的，因此存在一个正整数

p

，使得

V_{p} (x_{k}) - ε \leq V_{\infty} (x_{k}) \leq V_{p} (x_{k})

Hence, we can obtain 因此，我们可以得到

\begin{aligned} V_{\infty} (x_{k}) & \geq U (x_{k}, v_{p} (x_{k})) + V_{p} (F (x_{k}, v_{p} (x_{k}))) - ε \\ \geq U (x_{k}, v_{p} (x_{k})) + V_{\infty} (F (x_{k}, v_{p} (x_{k}))) - ε \\ \geq min_{u_{k}} {U (x_{k}, u_{k}) + V_{\infty} (x_{k + 1})} - ε . \end{aligned}

Since

ε

is arbitrary, we have
由于

ε

是任意的，我们有

V_{\infty} (x_{k}) \geq min_{u_{k}} {U (x_{k}, u_{k}) + V_{\infty} (x_{k + 1})}

Combining (24) and (26), we can obtain
结合（24）和（26），我们可以得到

V_{\infty} (x_{k}) = min_{u_{k}} {U (x_{k}, u_{k}) + V_{\infty} (x_{k + 1})}

Next, let

μ (x_{k})

be an arbitrary admissible control law, and define a new performance index function

P (x_{k})

, which satisfies
接下来，让

μ (x_{k})

成为一个任意可接受的控制律，并定义一个新的性能指标函数

P (x_{k})

，该函数满足

P (x_{k}) = U (x_{k}, μ (x_{k})) + P (x_{k + 1})

Then, we can declare the second step of the proof.
然后，我们可以声明证明的第二步。
2) Show that for an arbitrary admissible control law

μ (x_{k})

, the performance index function

V_{\infty} (x_{k}) \leq P (x_{k})

.
2) 证明对于任意可接受的控制律

μ (x_{k})

，性能指标函数

V_{\infty} (x_{k}) \leq P (x_{k})

。

The statement can be proven by mathematical induction. As

μ (x_{k})

is an admissible control law, we have

x_{k} \to 0

k \to \infty

. Without loss of generality, let

x_{N} = 0

, where

N \to \infty

. According to (28), we have
该陈述可以通过数学归纳法证明。由于

μ (x_{k})

是一个可接受的控制律，我们有

x_{k} \to 0

作为

k \to \infty

。不失一般性，设

x_{N} = 0

，其中

N \to \infty

。根据(28)，我们有

\begin{array}{r} P (x_{k}) = lim_{N \to \infty} {U (x_{k}, μ (x_{k})) + U (x_{k + 1}, μ (x_{k + 1})) + \dots \\ + U (x_{N - 1}, μ (x_{N - 1})) + P (x_{N})} \end{array}

where

x_{N} = 0

. According to (27), the iterative performance index function

V_{\infty} (x_{k})

can be expressed as
在

x_{N} = 0

。根据(27)，迭代性能指标函数

V_{\infty} (x_{k})

可以表示为

\begin{aligned} V_{\infty} (x_{k}) = lim_{N \to \infty} {U (x_{k}, v_{\infty} (x_{k})) + U (x_{k + 1}, v_{\infty} (x_{k + 1})) \\ + \dots + U (x_{N - 1}, v_{\infty} (x_{N - 1})) + V_{\infty} (x_{N})} \\ = lim_{N \to \infty} {{m i n}_{u_{k}} {U (x_{k}, u_{k}) + min_{u_{k + 1}} {U (x_{k + 1}, u_{k + 1}) + \dots \\ + min_{u_{N - 1}} {U (x_{N - 1}, u_{N - 1}) + V_{\infty} (x_{N})}}}} . \end{aligned}

According to Corollary 3.1 , for

\forall i = 0, 1, \dots

, since

v_{i} (x_{k})

is an admissible control law, we have that

v_{\infty} (x_{k}) =

lim_{i \to \infty} v_{i} (x_{k})

is an admissible control law. Then, we can get

x_{N} = 0

, where

N \to \infty

, which means

V_{\infty} (x_{N}) = P (x_{N}) = 0

. For

N - 1

, according to (27), we can obtain
根据推论 3.1，对于

\forall i = 0, 1, \dots

，由于

v_{i} (x_{k})

是一个可接受的控制律，我们有

v_{\infty} (x_{k}) =

lim_{i \to \infty} v_{i} (x_{k})

是一个可接受的控制律。然后，我们可以得到

x_{N} = 0

，其中

N \to \infty

，这意味着

V_{\infty} (x_{N}) = P (x_{N}) = 0

。对于

N - 1

，根据(27)，我们可以得到

\begin{aligned} P (x_{N - 1}) & = U (x_{N - 1}, μ (x_{N - 1})) + P (x_{N}) \\ \geq min_{u_{N - 1}} {U (x_{N - 1}, u_{N - 1}) + P (x_{N})} \\ = min_{u_{N - 1}} {U (x_{N - 1}, u_{N - 1}) + V_{\infty} (x_{N})} \\ = V_{\infty} (x_{N - 1}) \end{aligned}

Assume that the statement holds for

k = l + 1, l = 0, 1, \dots

Then, for

k = l

, we have
假设该陈述对

k = l + 1, l = 0, 1, \dots

成立。那么，对于

k = l

，我们有

\begin{aligned} P (x_{l}) & = U (x_{l}, μ (x_{l})) + P (x_{l + 1}) \\ \geq min_{u_{l}} {U (x_{l}, u_{l}) + P (x_{l + 1})} \\ \geq min_{u_{l}} {U (x_{l}, u_{l}) + V_{\infty} (x_{l + 1})} \\ = V_{\infty} (x_{l}) \end{aligned}

Hence, for

\forall x_{k}, k = 0, 1, \dots

, the inequality
因此，对于

\forall x_{k}, k = 0, 1, \dots

，不等式

V_{\infty} (x_{k}) \leq P (x_{k})

holds. Mathematical induction is completed.
持有。数学归纳法已完成。
3) Show that the performance index function

V_{\infty} (x_{k})

is equal to the optimal performance index function

J^{*} (x_{k})

.
3) 证明性能指标函数

V_{\infty} (x_{k})

等于最优性能指标函数

J^{*} (x_{k})

。

According to the definition of

J^{*} (x_{k})

in (3), for

\forall i = 0, 1, \dots

, we have
根据(3)中

J^{*} (x_{k})

的定义，对于

\forall i = 0, 1, \dots

，我们有

V_{i} (x_{k}) \geq J^{*} (x_{k})

Let

i \to \infty

, and then we can obtain
让

i \to \infty

，然后我们可以得到

V_{\infty} (x_{k}) \geq J^{*} (x_{k})

On the other hand, for an arbitrary admissible control law

μ (x_{k})

, we have (33) holds. Let

μ (x_{k}) = u^{*} (x_{k})

, where

u^{*} (x_{k})

is an optimal control law. Then, we can obtain
另一方面，对于任意可接受的控制律

μ (x_{k})

，我们有(33)成立。设

μ (x_{k}) = u^{*} (x_{k})

，其中

u^{*} (x_{k})

是最优控制律。然后，我们可以得到

V_{\infty} (x_{k}) \leq J^{*} (x_{k})

According to (35) and (36), we can obtain (19). The proof is completed.
根据（35）和（36），我们可以得到（19）。证明完成。

Corollary 3.2: Let

x_{k} \in R^{n}

be an arbitrary state vector. If Theorem 3.2 holds, then we have that the iterative control law

v_{i} (x_{k})

converges to the optimal control law as

i \to \infty

, i.e.,

u^{*} (x_{k}) = lim_{i \to \infty} v_{i} (x_{k})

.
推论 3.2：设

x_{k} \in R^{n}

为任意状态向量。如果定理 3.2 成立，则我们有迭代控制律

v_{i} (x_{k})

收敛于最优控制律，当

i \to \infty

时，即

u^{*} (x_{k}) = lim_{i \to \infty} v_{i} (x_{k})

。

Remark 3.1: From Theorems 3.1 and 3.2, we have proven that the iterative performance index functions are monotonically nonincreasing and convergent to the optimal performance index function as

i \to \infty

. The stability of the nonlinear systems can also be guaranteed under the iterative control laws. Hence, we can say that the convergence and stability properties of the policy iteration algorithms for continuous-time nonlinear systems are also valid for the policy iteration algorithms for discrete-time nonlinear systems. However, we emphasize that the analysis techniques between the continuous- and discrete-time policy iteration algorithms are quite different. First, the HJB equations of continuousand discrete-time systems are different inherently. Second, the analysis method for continuous-time policy iteration algorithm is based on derivative techniques [17], [43]-[47]. The analysis method for continuous-time policy iteration algorithm is invalid for discrete-time situation. In this paper, for the first time, we have established the properties for the discrete-time policy iteration algorithms, which established the theoretical basis of the applications for the optimal control problem of discrete-time nonlinear systems using policy iteration algorithms. Hence, we say that the discussion in this paper is meaningful.
备注 3.1：根据定理 3.1 和 3.2，我们已经证明迭代性能指标函数是单调非增的，并且在

i \to \infty

时收敛到最优性能指标函数。在迭代控制律下，非线性系统的稳定性也可以得到保证。因此，我们可以说，连续时间非线性系统的策略迭代算法的收敛性和稳定性特性同样适用于离散时间非线性系统的策略迭代算法。然而，我们强调，连续时间和离散时间策略迭代算法之间的分析技术是截然不同的。首先，连续时间和离散时间系统的 HJB 方程本质上是不同的。其次，连续时间策略迭代算法的分析方法基于导数技术 [17]，[43]-[47]。连续时间策略迭代算法的分析方法在离散时间情况下是无效的。在本文中，我们首次建立了离散时间策略迭代算法的性质，这为使用策略迭代算法解决离散时间非线性系统的最优控制问题奠定了理论基础。因此，我们认为本文的讨论是有意义的。

Remark 3.2: In [39], value iteration algorithm of ADP is proposed to obtain the optimal solution of the HJB equation for the following discrete-time affine nonlinear systems:
备注 3.2：在 [39] 中，提出了自适应动态规划的值迭代算法，以获得以下离散时间仿射非线性系统的 HJB 方程的最优解：

x_{k + 1} = f (x_{k}) + g (x_{k}) u_{k}

with the performance index function
与性能指标函数一起

J (x_{k}) = \sum_{j = k}^{\infty} (x_{j}^{T} Q x_{j} + u_{j}^{T} R u_{j})

where

Q

and

R

are defined as the penalizing matrices for system state and control vectors, respectively. Let

Q \in R^{n \times n}

and

R \in R^{m \times m}

be both positive definite matrices. Starting
其中

Q

和

R

分别定义为系统状态和控制向量的惩罚矩阵。设

Q \in R^{n \times n}

和

R \in R^{m \times m}

都是正定矩阵。开始
from

V_{0} (x_{k}) \equiv 0

, the value iteration algorithm can be expressed as
从

V_{0} (x_{k}) \equiv 0

开始，值迭代算法可以表示为

{\begin{cases} u_{i} (x_{k}) = - \frac{1}{2} R^{- 1} g {(x_{k})}^{T} \frac{\partial V_{i} (x_{k + 1})}{\partial x_{k + 1}} \\ V_{i + 1} (x_{k}) = x_{k}^{T} Q x_{k} + u_{i}^{T} (x_{k}) R u_{i} (x_{k}) + V_{i} (x_{k + 1}) \end{cases}

It was proven in [39] that

V_{i} (x_{k})

is a monotonically nondecreasing sequence and bounded, and hence converges to

J^{*} (x_{k})

. However, using the value iteration equation (39), the stability of the system under the iterative control law cannot be guaranteed. We should point out that using the policy iteration algorithm (7)-(10), the stability property can be guaranteed. In Section V, the numerical results and comparisons between the policy and value iteration algorithms will be presented to show these properties.
在[39]中证明了

V_{i} (x_{k})

是一个单调非减的有界序列，因此收敛到

J^{*} (x_{k})

。然而，使用值迭代方程(39)，在迭代控制法下系统的稳定性无法得到保证。我们应该指出，使用策略迭代算法(7)-(10)可以保证稳定性属性。在第五节中，将展示数值结果以及策略和值迭代算法之间的比较，以显示这些属性。

Hence, we say that if an admissible control of the nonlinear system is known, then policy iteration algorithm is preferred to obtain the optimal control law with good convergence and stability properties. From this point of view, the initial admissible control law is a key to success for the policy iteration algorithm. In the following section, an effective method will be discussed to obtain the initial admissible control law using neural networks.
因此，我们说如果已知非线性系统的可接受控制，则优先使用策略迭代算法来获得具有良好收敛性和稳定性特性的最优控制律。从这个角度来看，初始可接受控制律是策略迭代算法成功的关键。在接下来的部分，将讨论一种有效的方法，通过神经网络获得初始可接受控制律。

C. Obtaining the Initial Admissible Control Law
C. 获取初始可接受控制律

As the policy iteration algorithm requires to start with an admissible control law, the method of obtaining the admissible control law is important to the implementation of the algorithm. Actually, for all the policy iteration algorithms of ADP, including [17] and [43]-[47], the initial admissible control law is necessary to implement their algorithms. Unfortunately, to the best of our knowledge, there is no theory on how to design the initial admissible control law. In this paper, we will give an effective method to obtain the initial admissible control law by the experiments using neural networks.
由于策略迭代算法需要从一个可接受的控制律开始，因此获取可接受控制律的方法对算法的实现至关重要。实际上，对于所有的自适应动态规划（ADP）策略迭代算法，包括[17]和[43]-[47]，初始可接受控制律是实现其算法所必需的。不幸的是，据我们所知，目前尚无关于如何设计初始可接受控制律的理论。在本文中，我们将通过使用神经网络的实验提供一种有效的方法来获取初始可接受控制律。

First, let

Ψ (x_{k}) \geq 0

be an arbitrary semipositive definite function. For

i = 0, 1, \dots

, we define a new performance index function as
首先，让

Ψ (x_{k}) \geq 0

是一个任意的半正定函数。对于

i = 0, 1, \dots

，我们定义一个新的性能指标函数为

Φ_{i + 1} (x_{k}) = U (x_{k}, μ (x_{k})) + Φ_{i} (x_{k + 1})

where

Φ_{0} (x_{k}) = Ψ (x_{k})

. Then, we can derive the following theorem.
在

Φ_{0} (x_{k}) = Ψ (x_{k})

的地方。然后，我们可以推导出以下定理。

Theorem 3.3: Suppose Assumption 2.1 holds. Let

Ψ (x_{k}) \geq 0

be an arbitrary semipositive definite function. Let

μ (x_{k})

be an arbitrary control law for (1), which satisfies

μ (0) = 0

. Define the iterative performance index function

Φ_{i} (x_{k})

as (40), where

Φ_{0} (x_{k}) = Ψ (x_{k})

. Then,

μ (x_{k})

is an admissible control law if and only if the limit of

Φ_{i} (x_{k})

exists as

i \to \infty

.
定理 3.3：假设假设 2.1 成立。设

Ψ (x_{k}) \geq 0

为任意半正定函数。设

μ (x_{k})

为 (1) 的任意控制律，满足

μ (0) = 0

。定义迭代性能指标函数

Φ_{i} (x_{k})

为 (40)，其中

Φ_{0} (x_{k}) = Ψ (x_{k})

。则，当且仅当

Φ_{i} (x_{k})

的极限存在时，

μ (x_{k})

是可接受的控制律。

Proof: We first prove the sufficiency of the statement. Assume that

μ (x_{k})

is an admissible control law. According to (40), we have
证明：我们首先证明该陈述的充分性。假设

μ (x_{k})

是一个可接受的控制律。根据 (40)，我们有

\begin{aligned} Φ_{i + 1} (x_{k}) - Φ_{i} (x_{k}) = & U (x_{k}, μ (x_{k})) + Φ_{i} (x_{k + 1}) \\ - (U (x_{k}, μ (x_{k})) + Φ_{i - 1} (x_{k + 1})) \\ = & Φ_{i} (x_{k + 1}) - Φ_{i - 1} (x_{k + 1}) \end{aligned}

Algorithm 1 Obtain the Initial Admissible Control Law
算法 1 获取初始可接受控制律

Initialization: 初始化：

Choose a semi-positive definite function

Ψ (x_{k}) \geq 0

;
选择一个半正定函数

Ψ (x_{k}) \geq 0

;
Initialize two neural networks (critic networks for brief)
初始化两个神经网络（简要称为评论网络）
cnet 1 and cnet 2 with small random weights;
cnet 1 和 cnet 2 具有小的随机权重；
Let

Φ_{0} (x_{k}) = Ψ (x_{k})

; 让

Φ_{0} (x_{k}) = Ψ (x_{k})

;
Give the max iteration of computation

i_{max}

.
给出计算的最大迭代次数

i_{max}

。

Iteration: 迭代：

1: Establish a neural network (action network for brief) with small random weights to generate an initial control law

μ (x_{k})

with

μ (x_{k}) = 0

for

x_{k} = 0

;
建立一个具有小随机权重的神经网络（简要称为行动网络），以生成初始控制律

μ (x_{k})

，用于

μ (x_{k}) = 0

的

x_{k} = 0

；
2: Let

i = 0

. Train the critic network cnet 1 to approximate

Φ_{1} (x_{k})

, where

Φ_{1} (x_{k})

satisfies
2: 让

i = 0

。训练评论网络 cnet 1 以近似

Φ_{1} (x_{k})

，其中

Φ_{1} (x_{k})

满足

Φ_{1} (x_{k}) = U (x_{k}, μ (x_{k})) + Φ_{0} (x_{k + 1})

: Copy cnet 1 to

cnet 2

, i.e., cnet

2 = cnet 1

;
将 cnet 1 复制到

cnet 2

，即 cnet

2 = cnet 1

;
4: Let

i = i + 1

. Use cnet 2 to get

Φ_{i} (x_{k + 1})

and train the critic network cnet 1 to approximate

Φ_{i + 1} (x_{k})

, where

Φ_{i + 1} (x_{k})

satisfies
4: 让

i = i + 1

。使用 cnet 2 获取

Φ_{i} (x_{k + 1})

，并训练评论网络 cnet 1 以逼近

Φ_{i + 1} (x_{k})

，其中

Φ_{i + 1} (x_{k})

满足

Φ_{i + 1} (x_{k}) = U (x_{k}, μ (x_{k})) + Φ_{i} (x_{k + 1})

5: Use cnet1 to get

Φ_{i + 1} (x_{k})

and use cnet 2 to get

Φ_{i} (x_{k})

. If

| Φ_{i + 1} (x_{k}) - Φ_{i} (x_{k}) | < ε

, then goto Step 7. Else goto next step;
使用 cnet1 获取

Φ_{i + 1} (x_{k})

，使用 cnet2 获取

Φ_{i} (x_{k})

。如果

| Φ_{i + 1} (x_{k}) - Φ_{i} (x_{k}) | < ε

，则转到第 7 步。否则转到下一步；
: If

i > i_{max}

, then goto Step 1. Else goto Step 3; return

μ (x_{k})

and let

v_{0} (x_{k}) = μ (x_{k})

.
如果

i > i_{max}

，则转到步骤 1。否则转到步骤 3；返回

μ (x_{k})

并让

v_{0} (x_{k}) = μ (x_{k})

。

Then, according to (41), we can obtain
然后，根据（41），我们可以得到

{\begin{aligned} Φ_{i + 1} (x_{k}) - Φ_{i} (x_{k}) & = Φ_{1} (x_{k + i}) - Φ_{0} (x_{k + i}) \\ Φ_{i} (x_{k}) - Φ_{i - 1} (x_{k}) & = Φ_{1} (x_{k + i - 1}) - Φ_{0} (x_{k + i - 1}) \\ ⋮ \\ Φ_{1} (x_{k}) - Φ_{0} (x_{k}) & = Φ_{1} (x_{k}) - Φ_{0} (x_{k}) \end{aligned}

Then, we can get 然后，我们可以得到

Φ_{i + 1} (x_{k}) = \sum_{j = 0}^{i} U (x_{k + j}, μ (x_{k + j})) + Ψ (x_{k})

Let

i \to \infty

. We can obtain
让

i \to \infty

。我们可以得到

lim_{i \to \infty} Φ_{i} (x_{k}) = \sum_{j = 0}^{\infty} U (x_{k + j}, μ (x_{k + j})) + Ψ (x_{k})

μ (x_{k})

is an admissible control law, then we have that

\sum_{j = 0}^{\infty} U (x_{k + j}, μ (x_{k + j}))

is finite. As for an arbitrary finite

x_{k}, Ψ (x_{k})

is finite, then for

\forall i = 0, 1, 2, \dots

, we have that

Φ_{i + 1} (x_{k})

is finite. Hence,

lim_{i \to \infty} Φ_{i} (x_{k})

is finite, which means

Φ_{i + 1} (x_{k}) = Φ_{i} (x_{k})

, as

i \to \infty

.
如果

μ (x_{k})

是一个可接受的控制律，那么我们有

\sum_{j = 0}^{\infty} U (x_{k + j}, μ (x_{k + j}))

是有限的。至于任意有限的

x_{k}, Ψ (x_{k})

是有限的，那么对于

\forall i = 0, 1, 2, \dots

，我们有

Φ_{i + 1} (x_{k})

是有限的。因此，

lim_{i \to \infty} Φ_{i} (x_{k})

是有限的，这意味着

Φ_{i + 1} (x_{k}) = Φ_{i} (x_{k})

，如同

i \to \infty

。

On the other hand, if the limit of

Φ_{i} (x_{k})

exists as

i \to \infty

, according to (42)-(44), as

Ψ (x_{k})

is finite, then we can get that

\sum_{j = 0}^{\infty} U (x_{k + j}, μ (x_{k + j}))

is finite. Since the utility function

U (x_{k}, u_{k})

is positive definite for

\forall x_{k}, u_{k}

, then we can obtain

U (x_{k + j}, μ (x_{k + j})) \to 0

j \to \infty

. As

μ (x_{k}) = 0

for

x_{k} = 0

, we can get that

x_{k} \to 0

k \to \infty

, which means that (1) is stable and

μ (x_{k})

is an admissible control law. The necessity of the statement is proven and the proof is completed.
另一方面，如果

Φ_{i} (x_{k})

的极限存在为

i \to \infty

，根据(42)-(44)，由于

Ψ (x_{k})

是有限的，那么我们可以得出

\sum_{j = 0}^{\infty} U (x_{k + j}, μ (x_{k + j}))

是有限的。由于效用函数

U (x_{k}, u_{k})

对

\forall x_{k}, u_{k}

是正定的，因此我们可以得到

U (x_{k + j}, μ (x_{k + j})) \to 0

为

j \to \infty

。由于

μ (x_{k}) = 0

对于

x_{k} = 0

，我们可以得出

x_{k} \to 0

为

k \to \infty

，这意味着(1)是稳定的，

μ (x_{k})

是一个可接受的控制律。该陈述的必要性已被证明，证明完成。

$\underset{―}{Algorithm 2}$ Discrete-Time Policy Iteration Algorithm Initialization:
离散时间策略迭代算法初始化：

Choose randomly an array of initial states

x_{0}

;
随机选择一个初始状态数组

x_{0}

;
Choose a computation precision

ε

;
选择计算精度

ε

;
Give the initial admissible control law

v_{0} (x_{k})

;
给出初始可接受控制律

v_{0} (x_{k})

;
Give the max iteration of computation

i_{max}

.
给出计算的最大迭代次数

i_{max}

。

Iteration: 迭代：

Let the iteration index

i = 0

;
让迭代索引

i = 0

;
Construct the iterative performance index function

V_{0} (x_{k})

according to

v_{0} (x_{k})

by
根据

v_{0} (x_{k})

构建迭代性能指标函数

V_{0} (x_{k})

V_{0} (x_{k}) = U (x_{k}, v_{0} (x_{k})) + V_{0} (x_{k + 1})

2: Update the iterative control law by
更新迭代控制律为

v_{1} (x_{k}) = \arg min_{u_{k}} {U (x_{k}, u_{k}) + V_{0} (x_{k + 1})}

3: Let

i = i + 1

. Construct the iterative performance index function

V_{i} (x_{k})

, which satisfies the following GHJB equation
3：让

i = i + 1

。构造迭代性能指标函数

V_{i} (x_{k})

，满足以下 GHJB 方程。

V_{i} (x_{k}) = U (x_{k}, v_{i} (x_{k})) + V_{i} (F (x_{k}, v_{i} (x_{k})))

4: Update the iterative control law

v_{i + 1} (x_{k})

by
4: 通过更新迭代控制律

v_{i + 1} (x_{k})

v_{i + 1} (x_{k}) = \arg min_{u_{k}} {U (x_{k}, u_{k}) + V_{i} (x_{k + 1})}

5: If

V_{i - 1} (x_{k}) - V_{i} (x_{k}) < ε

, goto Step 7. Else goto Step 6; 6: If

i < i_{max}

, then goto Step 3. Else, goto Step 8 ;
5: 如果

V_{i - 1} (x_{k}) - V_{i} (x_{k}) < ε

，则跳转到步骤 7。否则跳转到步骤 6；6: 如果

i < i_{max}

，则跳转到步骤 3。否则，跳转到步骤 8；
return

v_{i} (x_{k})

and

V_{i} (x_{k})

. The optimal control law is achieved;
返回

v_{i} (x_{k})

和

V_{i} (x_{k})

。最优控制律已实现；
return The optimal control law is not achieved within

i_{max}

iterations.
返回最优控制律在

i_{max}

次迭代内未能实现。

According to Theorem 3.3, we can establish an effective iteration algorithm by repeating experiments using neural networks. The detailed implementation of the iteration algorithm is expressed in Algorithm 1.
根据定理 3.3，我们可以通过使用神经网络重复实验来建立有效的迭代算法。迭代算法的详细实现表达在算法 1 中。

Remark 3.3: We can see that the previously outlined training procedure can be easily implemented by computer program. Creating an action network, the weights of the critic networks cnet 1 and cnet 2 can be updated iteratively. If the weights of the critic network are convergent, then

μ (x_{k})

must be an admissible control law. As the weights of actin network are chosen randomly, the admissibility of the control law is unknown before the weights of the critic networks are convergent. Hence, the iteration process of Algorithm 1 is implemented offline. For convenience of implementation, in this paper, we choose

Ψ (x_{k}) \equiv 0

to obtain the initial admissible control law. The detailed training method of neural networks is shown in the following section.
备注 3.3：我们可以看到，之前概述的训练过程可以很容易地通过计算机程序实现。创建一个动作网络后，评论网络 cnet 1 和 cnet 2 的权重可以迭代更新。如果评论网络的权重是收敛的，那么

μ (x_{k})

必须是一个可接受的控制律。由于动作网络的权重是随机选择的，因此在评论网络的权重收敛之前，控制律的可接受性是未知的。因此，算法 1 的迭代过程是在离线进行的。为了方便实现，在本文中，我们选择

Ψ (x_{k}) \equiv 0

来获得初始可接受的控制律。神经网络的详细训练方法将在以下部分中展示。

D. Summary of the Policy Iteration Algorithm of ADP
D. 自适应动态规划（ADP）政策迭代算法的总结

According to the above preparations, we can summarize the discrete-time policy iteration algorithm in Algorithm 2.
根据上述准备，我们可以在算法 2 中总结离散时间策略迭代算法。

Fig. 1. Structure diagram of the algorithm.
图 1. 算法结构图。

IV. NEural NETwORK IMPLEMENTATION FOR THE Optimal CONTROL ScHEME
IV. 神经网络在最优控制方案中的实现

In this paper, BP neural networks are used to approximate

v_{i} (x_{k})

and

V_{i} (x_{k})

, respectively. The number of hidden layer neurons is denoted by

l

, the weight matrix between the input layer and hidden layer is denoted by

Y

, and the weight matrix between the hidden layer and output layer is denoted by

W

. Then, the output of three-layer neural network is represented by
在本文中，BP 神经网络分别用于逼近

v_{i} (x_{k})

和

V_{i} (x_{k})

。隐层神经元的数量用

l

表示，输入层与隐层之间的权重矩阵用

Y

表示，隐层与输出层之间的权重矩阵用

W

表示。然后，三层神经网络的输出表示为

\hat{F} (X, Y, W) = W^{T} σ (Y^{T} X + b)

where

σ (Y^{T} X) \in R^{l}, [σ (z)]_{i} = ((e^{z_{i}} - e^{- z_{i}}) / (e^{z_{i}} + e^{- z_{i}}))

i = 1, \dots, l

are the activation functions and

b

is the threshold value.
其中

σ (Y^{T} X) \in R^{l}, [σ (z)]_{i} = ((e^{z_{i}} - e^{- z_{i}}) / (e^{z_{i}} + e^{- z_{i}}))

、

i = 1, \dots, l

是激活函数，

b

是阈值。

Here, there are two networks, which are critic network and action network, respectively. Both the neural networks are chosen as three-layer feedforward neural network. The whole structure diagram is shown in Fig. 1.
这里有两个网络，分别是评论网络和动作网络。两个神经网络都选择为三层前馈神经网络。整个结构图如图 1 所示。

A. Critic Network A. 评论网络

The critic network is used to approximate the performance index function

V_{i} (x_{k})

. The output of the critic network is denoted as
评论网络用于近似性能指标函数

V_{i} (x_{k})

。评论网络的输出表示为

{\hat{V}}_{i}^{j} (x_{k}) = W_{c i}^{j T} σ (Z_{c} (k))

where

Z_{c} (k) = Y_{c}^{T} x_{k} + b_{c}

. The target function is expressed in (9). Then, we define the error function for the critic network as
在

Z_{c} (k) = Y_{c}^{T} x_{k} + b_{c}

处。目标函数在（9）中表示。然后，我们为评论网络定义误差函数为

e_{c i}^{j} (k) = {\hat{V}}_{i}^{j} (x_{k}) - V_{i} (x_{k})

The objective function to be minimized in the critic network training is
在评论网络训练中需要最小化的目标函数是

E_{c i}^{j} (k) = \frac{1}{2} {(e_{c i}^{j} (k))}^{2}

Therefore, the gradient-based weight update rule [24] for the critic network is given by
因此，评论网络的基于梯度的权重更新规则 [24] 如下所示：

\begin{aligned} W_{c i}^{j + 1} (k) & = W_{c i}^{j} (k) + Δ W_{c i}^{j} (k) \\ = W_{c i}^{j} (k) - α_{c} [\frac{\partial E_{c i}^{j} (k)}{\partial {\hat{V}}_{i + 1}^{j} (x_{k})} \frac{\partial {\hat{V}}_{i + 1}^{j} (x_{k})}{\partial W_{c i}^{j} (k)}] \\ = W_{c i}^{j} (k) - α_{c} e_{c i}^{j} (k) σ (Z_{c} (k)) \end{aligned}

where

α_{c} > 0

is the learning rate of critic network. If the training precision is achieved, then we say that

V_{i + 1} (x_{k})

can be approximated by the critic network.
其中

α_{c} > 0

是评论网络的学习率。如果达到了训练精度，那么我们可以说

V_{i + 1} (x_{k})

可以被评论网络近似。

There is an important property we should point out. Usually, the iterative performance index function

V_{i} (x_{k})

is difficult to obtain directly as the critic network is not trained yet. In this situation, there are two methods to obtain the target function of

V_{i} (x_{k})

. First, define a new iteration index

l = 0, 1, \dots

and let the iterative performance index function be expressed as

V_{i, l} (x_{k})

. Then, let

V_{i, l} (x_{k})

be updated by
有一个重要的属性我们应该指出。通常，迭代性能指标函数

V_{i} (x_{k})

很难直接获得，因为评论网络尚未训练。在这种情况下，有两种方法可以获得目标函数

V_{i} (x_{k})

。首先，定义一个新的迭代指标

l = 0, 1, \dots

，并让迭代性能指标函数表示为

V_{i, l} (x_{k})

。然后，让

V_{i, l} (x_{k})

通过以下方式更新：

V_{i, l + 1} (x_{k}) = U (x_{k}, v_{i} (x_{k})) + {\hat{V}}_{i, l} (x_{k + 1})

where

{\hat{V}}_{i, 0} (x_{k + 1}) = V_{i - 1} (x_{k + 1})

is the output of the critic network in the previous iteration. The error function of the critic neural network can be expressed as
其中

{\hat{V}}_{i, 0} (x_{k + 1}) = V_{i - 1} (x_{k + 1})

是前一次迭代中评论网络的输出。评论神经网络的误差函数可以表示为

e_{c i, l} (k) = {\hat{V}}_{i, l + 1} (x_{k}) - V_{i, l + 1} (x_{k})

Updating the neural networks according to (48) and (49) until

e_{c i, l} (k) \to 0

. Let

l = l + 1

and repeat (50) and (51), until the following holds:
根据（48）和（49）更新神经网络，直到

e_{c i, l} (k) \to 0

。让

l = l + 1

并重复（50）和（51），直到以下条件成立：

V_{i, l} (x_{k}) = U (x_{k}, v_{i} (x_{k})) + V_{i, l} (x_{k + 1})

Then, we have

V_{i} (x_{k}) = V_{i, l} (x_{k})

. This method is simple. However, usually, (52) holds for

l \to \infty

, i.e.,
然后，我们有

V_{i} (x_{k}) = V_{i, l} (x_{k})

。这个方法很简单。然而，通常情况下，(52) 对于

l \to \infty

成立，即：

V_{i, \infty} (x_{k}) = U (x_{k}, v_{i} (x_{k})) + V_{i, \infty} (x_{k + 1})

Thus, the amount of computation is very large and it will take a long time to make the weights of the critic network converge.
因此，计算量非常大，评论网络的权重收敛将需要很长时间。

The other method can be expressed as follows. According to (9), we have
另一种方法可以表达如下。根据（9），我们有

V_{i} (x_{k}) = \sum_{j = 0}^{N - 1} U (x_{k + j}, v_{i} (x_{k + j})) + V_{i} (x_{k + N})

v_{i} (x_{k})

is an admissible control law, we have

V_{i} (x_{k + N}) \to 0

for

N \to \infty

. Thus, choose a large enough

N

and we have
由于

v_{i} (x_{k})

是一个可接受的控制律，我们有

V_{i} (x_{k + N}) \to 0

对于

N \to \infty

。因此，选择一个足够大的

N

，我们有

V_{i} (x_{k}) ≐ \sum_{j = 0}^{N - 1} U (x_{k + j}, v_{i} (x_{k + j}))

Then, the error function (47) can be obtained. We can see that this method is much easier to apply. For convenience of computation, the second method is used in this paper.
然后，可以得到误差函数（47）。我们可以看到，这种方法更容易应用。为了方便计算，本文采用第二种方法。

B. Action Network B. 行动网络

In the action network, the state error

x_{k}

is used as input to create the iterative control law as the output of the network. The output can be formulated as
在动作网络中，状态误差

x_{k}

被用作输入，以生成网络的输出迭代控制律。输出可以表述为

{\hat{v}}_{i}^{j} (x_{k}) = W_{a i}^{j T} σ (Z_{a} (k))

where

Z_{a} (k) = Y_{a}^{T} x_{k} + b_{a}

. The target of the output of the action network is given by (10). Thus, we can define the output error of the action network as
在

Z_{a} (k) = Y_{a}^{T} x_{k} + b_{a}

处。动作网络的输出目标由（10）给出。因此，我们可以将动作网络的输出误差定义为

e_{a i}^{j} (k) = {\hat{v}}_{i}^{j} (x_{k}) - v_{i} (x_{k})

The weights of the action network are updated to minimize the following performance error measure:
行动网络的权重被更新，以最小化以下性能误差度量：

E_{a i}^{j} (k) = \frac{1}{2} {(e_{a i}^{j} (k))}^{T} (e_{a i}^{j} (k))

The weights updating algorithm is similar to the one for the critic network. By the gradient descent rule, we can obtain
权重更新算法类似于评论网络的算法。通过梯度下降规则，我们可以得到

\begin{aligned} W_{a i}^{j + 1} (k) & = W_{a i}^{j} (k) + Δ W_{a i}^{j} (k) \\ = W_{a i}^{j} (k) - β_{a} [\frac{\partial E_{a i}^{j} (k)}{\partial e_{a i}^{j} (k)} \frac{\partial e_{a i}^{j} (k)}{\partial {\hat{v}}_{i}^{j} (k)} \frac{\partial {\hat{v}}_{i}^{j} (k)}{\partial W_{a i}^{j} (k)}] \\ = W_{a i}^{j} (k) - β_{a} σ (Z_{a} (k)) {(e_{a i}^{j} (k))}^{T} \end{aligned}

where

β_{a} > 0

is the learning rate of action network. If the training precision is achieved, then we say that the iterative control law

v_{i} (x_{k})

can be approximated by the action network.
其中

β_{a} > 0

是动作网络的学习率。如果达到了训练精度，那么我们可以说迭代控制律

v_{i} (x_{k})

可以通过动作网络进行近似。

In this paper, to enhance the convergence speed of the neural networks, only one layer of neural network is updated during the training procedure. To guarantee the effectiveness of the neural network implementation, the convergence of the neural network weights is proven, which makes the iterative performance index function and iterative control be approximated by the critic and action networks, respectively. The weights convergence property of the neural networks is shown in the following theorem.
在本文中，为了提高神经网络的收敛速度，在训练过程中仅更新一层神经网络。为了保证神经网络实现的有效性，证明了神经网络权重的收敛性，这使得迭代性能指标函数和迭代控制分别由评估网络和动作网络进行近似。神经网络的权重收敛特性在以下定理中展示。

Theorem 4.1: Let the target performance index function and the target iterative control law be expressed by
定理 4.1：设目标性能指标函数和目标迭代控制律表示为

V_{i + 1} (x_{k}) = W_{c i}^{* T} σ (Z_{c} (k))

and 和

v_{i} (x_{k}) = W_{a i}^{* T} σ (Z_{a} (k))

respectively. Let the critic and action networks be trained by (49) and (58), respectively. If the learning rates

α_{c}

and

β_{a}

are both small enough, then we have the critic weights

W_{c i} (k)

and action network weights

W_{a i} (k)

are asymptotically convergent to the optimal weights

W_{c i}^{*} (k)

and

W_{a i}^{*} (k)

, respectively.
分别。让评论网络和动作网络分别通过（49）和（58）进行训练。如果学习率

α_{c}

和

β_{a}

都足够小，那么评论权重

W_{c i} (k)

和动作网络权重

W_{a i} (k)

将渐近收敛到最优权重

W_{c i}^{*} (k)

和

W_{a i}^{*} (k)

，分别。

Proof: Let

{\bar{W}}_{c i}^{j} = W_{c i}^{j} - W_{c i}^{*}

and

{\bar{W}}_{a i}^{j} = W_{a i}^{j} - W_{a i}^{*}

. From (49) and (58), we have
证明：设

{\bar{W}}_{c i}^{j} = W_{c i}^{j} - W_{c i}^{*}

和

{\bar{W}}_{a i}^{j} = W_{a i}^{j} - W_{a i}^{*}

。根据(49)和(58)，我们有

{\bar{W}}_{c i}^{j + 1} (k) = {\bar{W}}_{c i}^{j} (k) - α_{c} e_{c i}^{j} (k) σ (Z_{c} (k))

and 和

{\bar{W}}_{a i}^{j + 1} (k) = {\bar{W}}_{a i}^{j} (k) - β_{a} e_{a i}^{j} (k) σ (Z_{a} (k))

Consider the following Lyapunov function candidate:
考虑以下李雅普诺夫函数候选：

L ({\bar{W}}_{c i}^{j}, {\bar{W}}_{a i}^{j}) = tr {{\bar{W}}_{c i}^{j T} {\bar{W}}_{c i}^{j} + {\bar{W}}_{a i}^{j T} {\bar{W}}_{a i}^{j}}

Then, the difference of the Lyapunov function candidate (61) is given by
然后，李雅普诺夫函数候选者（61）的差异由以下给出：

\begin{aligned} Δ L ({\bar{W}}_{c i}^{j}, {\bar{W}}_{a i}^{j}) = & tr {{\bar{W}}_{c i}^{(j + 1) T} {\bar{W}}_{c i}^{j + 1} + {\bar{W}}_{a i}^{(j + 1) T} {\bar{W}}_{a i}^{j + 1}} \\ - tr {{\bar{W}}_{c i}^{j T} {\bar{W}}_{c i}^{j} + {\bar{W}}_{a i}^{j T} {\bar{W}}_{a i}^{j}} \\ = & α_{c} {‖ e_{c i}^{j} (k) ‖}^{2} (- 2 + α_{c} {‖ σ (Z_{c} (k)) ‖}^{2}) \\ + β_{a} {‖ e_{a i}^{j} (k) ‖}^{2} (- 2 + β_{a} {‖ σ (Z_{a} (k)) ‖}^{2}) \end{aligned}

According to the definition of

σ (\cdot)

in (45), we know that

{‖ σ (Z_{c} (k)) ‖}^{2}

and

{‖ σ (Z_{a} (k)) ‖}^{2}

are both finite for

\forall Z_{c} (k), Z_{a} (k)

. Thus, if

α_{c}

and

β_{a}

are both small enough that satisfy

α_{c} \leq 2 / {‖ σ (Z_{c} (k)) ‖}^{2}

and

β_{a} \leq 2 / {‖ σ (Z_{a} (k)) ‖}^{2}

, then we have

Δ L ({\bar{W}}_{c i}^{j}, {\bar{W}}_{a i}^{j}) < 0

. The proof is completed.
根据(45)中

σ (\cdot)

的定义，我们知道

{‖ σ (Z_{c} (k)) ‖}^{2}

和

{‖ σ (Z_{a} (k)) ‖}^{2}

对于

\forall Z_{c} (k), Z_{a} (k)

都是有限的。因此，如果

α_{c}

和

β_{a}

都足够小以满足

α_{c} \leq 2 / {‖ σ (Z_{c} (k)) ‖}^{2}

和

β_{a} \leq 2 / {‖ σ (Z_{a} (k)) ‖}^{2}

，那么我们有

Δ L ({\bar{W}}_{c i}^{j}, {\bar{W}}_{a i}^{j}) < 0

。证明完成。

V. Numerical Analysis V. 数值分析

To evaluate the performance of our policy iteration algorithm, four examples have been chosen: 1) a linear system; 2) a discrete-time nonlinear system; 3) a torsional pendulum system; and 4) a complex nonlinear system to show its broad applicability.
为了评估我们的策略迭代算法的性能，选择了四个例子：1）线性系统；2）离散时间非线性系统；3）扭摆系统；4）复杂非线性系统，以展示其广泛的适用性。

Example 1: In the first example, a linear system is considered. The results will be compared with the traditional linear quadratic regulation method to verify the effectiveness of the developed algorithm. We consider the following linear system:
示例 1：在第一个示例中，考虑一个线性系统。将结果与传统的线性二次调节方法进行比较，以验证所开发算法的有效性。我们考虑以下线性系统：

x_{k + 1} = A x_{k} + B u_{k}

where

x_{k} = {[x_{1 k}, x_{2 k}]}^{T}

and

u_{k} \in R^{1}

. Let the system matrices be expressed as
在

x_{k} = {[x_{1 k}, x_{2 k}]}^{T}

和

u_{k} \in R^{1}

的地方。让系统矩阵表示为

A = [\begin{array}{cc} 0 & 0.1 \\ 0.3 & - 1 \end{array}], B = [\begin{matrix} 0 \\ 0.5 \end{matrix}]

The initial state is

x_{0} = [1, - 1]^{T}

. Let the performance index function be expressed by (2). The utility function is the quadratic form that is expressed as

U (x_{k}, u_{k}) = x_{k}^{T} Q x_{k} +

u_{k}^{T} R u_{k}

, where

Q = I, R = 0.5 I

, and

I

is the identity matrix with suitable dimensions.
初始状态为

x_{0} = [1, - 1]^{T}

。性能指标函数用 (2) 表示。效用函数是以

U (x_{k}, u_{k}) = x_{k}^{T} Q x_{k} +

u_{k}^{T} R u_{k}

表示的二次型，其中

Q = I, R = 0.5 I

和

I

是具有适当维度的单位矩阵。

Neural networks are used to implement the developed policy iteration algorithm. The critic network and the action network are chosen as three-layer BP neural networks with the structures of 2-8-1 and 2-8-1, respectively. For each iteration step, the critic network and the action network are trained for 80 steps using the learning rate of

α = 0.02

so that the neural network training error becomes less than

10^{- 5}

. The initial admissible control law is obtained by Algorithm 1, where the weights and thresholds of action network are obtained as
神经网络用于实现开发的策略迭代算法。评论网络和动作网络被选择为结构为 2-8-1 和 2-8-1 的三层 BP 神经网络。在每次迭代步骤中，评论网络和动作网络使用学习率

α = 0.02

训练 80 步，以使神经网络训练误差小于

10^{- 5}

。初始可接受控制律通过算法 1 获得，其中动作网络的权重和阈值获得为

\begin{aligned} Y_{a, initial} = & [\begin{array}{rr} - 4.1525 & - 1.1980 \\ 0.3693 & - 0.8828 \\ 1.8071 & 2.8088 \\ 0.4104 & - 0.9845 \\ 0.7319 & - 1.7384 \\ 1.2885 & - 2.5911 \\ - 0.3403 & 0.8154 \\ - 0.5647 & 1.3694 \end{array}] \\ W_{a, initial} = & [- 0.0010, - 0.2566, 0.0001, - 0.1409, - 0.0092, \\ 0.0001, 0.3738, 0.0998] \\ b_{a, initial} = & [3.5272, - 0.9609, - 1.8038, - 0.0970, 0.8526 \\ 1.1966, - 1.0948, 2.5641]^{T} \end{aligned}

respectively. Implement the policy iteration algorithm for six iterations to reach the computation precision

ε = 10^{- 5}

, and the convergence trajectory of the iterative performance index functions is shown in Fig. 2(a). During each iteration, the iterative control law is updated. Applying the iterative control law to the given system (62) for

T_{f} = 15

time steps, we can obtain the iterative states and iterative controls, which are shown in Fig. 2(b) and ©, respectively.
分别。实施政策迭代算法进行六次迭代，以达到计算精度

ε = 10^{- 5}

，迭代性能指标函数的收敛轨迹如图 2(a)所示。在每次迭代中，迭代控制律会被更新。将迭代控制律应用于给定系统(62)进行

T_{f} = 15

个时间步，我们可以获得迭代状态和迭代控制，分别如图 2(b)和(c)所示。

We can see that the optimal control law of (62) is obtained after six iterations. On the other hand, it is known that the solution of the optimal control problem for the linear system
我们可以看到，(62) 的最优控制律在经过六次迭代后获得。另一方面，已知线性系统的最优控制问题的解

Fig. 2. Numerical results of Example 1. (a) Convergence trajectory of iterative performance index function. (b) Iterative states. © Iterative controls. (d) Trajectories of

x_{2}

under the optimal control of policy iteration and ARE.
图 2. 示例 1 的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 迭代状态。(c) 迭代控制。(d) 在策略迭代和 ARE 的最优控制下

x_{2}

的轨迹。
is quadratic in the state and given as

J^{*} (x_{k}) = x_{k}^{T} P x_{k}

, where

P

is the solution of the ARE. The solution for the ARE for the given linear system (62) is
在状态中是二次的，表示为

J^{*} (x_{k}) = x_{k}^{T} P x_{k}

，其中

P

是 ARE 的解。给定线性系统（62）的 ARE 的解是

P = [\begin{array}{rr} 1.091 & - 0.309 \\ - 0.309 & 2.055 \end{array}]

The optimal control can be obtained as

u^{*} (x_{k}) =

[- 0.304, 1.029] x_{k}

. Applying the optimal control law to the linear system (62), we can obtain the same optimal control results. The optimal trajectories of system state

x_{2}

under the optimal control laws by policy iteration and ARE are shown in Fig. 2(d).
最优控制可以表示为

u^{*} (x_{k}) =

[- 0.304, 1.029] x_{k}

。将最优控制律应用于线性系统 (62)，我们可以获得相同的最优控制结果。系统状态

x_{2}

在政策迭代和 ARE 下的最优轨迹如图 2(d) 所示。

Example 2: The second example is chosen from the example in [32], [39], and [40]. We consider the following nonlinear system:
例子 2：第二个例子选自[32]、[39]和[40]中的例子。我们考虑以下非线性系统：

x_{k + 1} = f (x_{k}) + g (x_{k}) u_{k}

where

x_{k} = {[\begin{array}{ll} x_{1 k} & x_{2 k} \end{array}]}^{T}

and

u_{k}

are the state and control variables, respectively. The system functions are given as
其中

x_{k} = {[\begin{matrix} x_{1 k} & x_{2 k} \end{matrix}]}^{T}

和

u_{k}

分别是状态变量和控制变量。系统函数如下所示：

f (x_{k}) = [\begin{matrix} 0.2 x_{1 k} \exp (x_{2 k}^{2}) \\ 0.3 x_{2 k}^{3} \end{matrix}], g (x_{k}) = [\begin{matrix} 0 \\ - 0.2 \end{matrix}]

The initial state is

x_{0} = [2, - 1]^{T}

. In this example, two utility functions, which are quadratic and nonquadratic forms, will be considered, respectively. The first utility function is a quadratic form, which is the same as the one in Example 1 where the matrix

Q = R = I

. The configurations of the critic network and the action network are chosen the same as the ones in Example 1. For each iteration step, the critic network and the action network are trained for 100 steps using the learning rate of

α = 0.02

so that the neural network training error becomes less than

10^{- 5}

. Implement the policy iteration algorithm for six iterations to reach the computation precision

10^{- 5}

. The convergence trajectories of the iterative performance index functions are shown in Fig. 3(a). Applying the optimal control law to the given system (63) for

T_{f} = 10

time steps, we can obtain the iterative states trajectories and control, which are shown in Fig. 3(b)-(d), respectively.
初始状态为

x_{0} = [2, - 1]^{T}

。在这个例子中，将分别考虑两个效用函数，一个是二次形式，另一个是非二次形式。第一个效用函数是二次形式，与示例 1 中的矩阵

Q = R = I

相同。评论网络和动作网络的配置与示例 1 中的配置相同。在每次迭代步骤中，评论网络和动作网络使用学习率

α = 0.02

训练 100 步，以使神经网络训练误差小于

10^{- 5}

。实施策略迭代算法进行六次迭代，以达到计算精度

10^{- 5}

。迭代性能指标函数的收敛轨迹如图 3(a)所示。将最优控制律应用于给定系统(63)进行

T_{f} = 10

个时间步，我们可以获得迭代状态轨迹和控制，分别如图 3(b)-(d)所示。

Fig. 3. Numerical results using policy iteration algorithm. (a) Convergence trajectory of iterative performance index function. (b) Optimal trajectory of state

x_{1}

x_{2}

. (d) Optimal control.
图 3. 使用策略迭代算法的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 状态

x_{1}

的最优轨迹。(c) 状态

x_{2}

的最优轨迹。(d) 最优控制。

Fig. 4. Numerical results using value iteration algorithm. (a) Convergence trajectory of iterative performance index function. (b) Optimal trajectory of state

x_{1}

x_{2}

. (d) Optimal control.
图 4. 使用值迭代算法的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 状态

x_{1}

的最优轨迹。(c) 状态

x_{2}

的最优轨迹。(d) 最优控制。

In [39], it is proven that the optimal control law can be obtained by value iteration algorithm (39). The convergence trajectory of the iterative performance index function is shown in Fig. 4(a). The optimal trajectories of system states and control are shown in Fig. 4(b)-(d), respectively.
在[39]中，证明了最优控制律可以通过值迭代算法获得（39）。迭代性能指标函数的收敛轨迹如图 4(a)所示。系统状态和控制的最优轨迹分别如图 4(b)-(d)所示。

Next, we change the utility function into a nonquadratic form in [34] with modifications, where the utility function is expressed as
接下来，我们将效用函数修改为[34]中的非二次形式，其中效用函数表示为

U (x_{k}, u_{k}) = \ln (x_{k}^{T} Q x_{k} + 1) + \ln (x_{k}^{T} Q x_{k} + 1) u_{k}^{T} R u_{k}

Let the other parameters remain unchanged. Using the developed policy iteration algorithm, we can also obtain the optimal control law of the system. The performance index function is shown in Fig. 5(a). The optimal trajectories of iterative states and controls are shown in Fig. 5(b)-(d), respectively.
让其他参数保持不变。使用开发的策略迭代算法，我们还可以获得系统的最优控制律。性能指标函数如图 5(a)所示。迭代状态和控制的最优轨迹分别如图 5(b)-(d)所示。

Fig. 5. Numerical results for nonquadratic utility function using policy iteration algorithm. (a) Convergence trajectory of iterative performance index function. (b) Iterative states. © Iterative controls. (d) Optimal states and control.
图 5. 使用策略迭代算法的非二次效用函数的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 迭代状态。(c) 迭代控制。(d) 最优状态和控制。

Remark 5.1: From the numerical results, we can see that for quadratic and nonquadratic utility functions, the optimal control law of the nonlinear system can both be effectively obtained, respectively. On the other hand, We have shown that using the value iteration algorithm in [39], we can also obtain the optimal control law of the system. However, we should point out that the convergence properties of the iterative performance index functions by the policy and value iteration algorithm are obviously different. This makes the stability of the control system under the iterative control law quite different. In the following example, detailed comparisons will be displayed.
备注 5.1：从数值结果来看，对于二次和非二次效用函数，非线性系统的最优控制律都可以有效获得。另一方面，我们已经展示了使用文献[39]中的值迭代算法，我们也可以获得系统的最优控制律。然而，我们应该指出，策略和价值迭代算法的迭代性能指标函数的收敛特性显然不同。这使得在迭代控制律下控制系统的稳定性有很大差异。在下面的例子中，将展示详细的比较。

Example 3: We now examine the performance of the developed algorithm in a torsional pendulum system [24]. The dynamics of the pendulum is as follows:
示例 3：我们现在检查在扭摆系统中开发的算法的性能 [24]。摆的动态如下：

{\begin{cases} \frac{d θ}{d t} = ω \\ J \frac{d ω}{d t} = u - M g l \sin θ - f_{d} \frac{d θ}{d t} \end{cases}

where

M = 1 / 3 kg

and

l = 2 / 3 m

are the mass and length of the pendulum bar, respectively. The system states are the current angle

θ

and the angular velocity

ω

. Let

J = 4 / 3 M l^{2}

and

f_{d} = 0.2

be the rotary inertia and frictional factor, respectively. Let

g = 9.8 m / s^{2}

be the gravity. Discretization of the system function and performance index function using Euler and trapezoidal methods [48] with the sampling interval

Δ t = 0.1 s

leads to
其中

M = 1 / 3 kg

和

l = 2 / 3 m

分别是摆杆的质量和长度。系统状态是当前角度

θ

和角速度

ω

。设

J = 4 / 3 M l^{2}

和

f_{d} = 0.2

分别为转动惯量和摩擦系数。设

g = 9.8 m / s^{2}

为重力。使用欧拉法和梯形法对系统函数和性能指标函数进行离散化 [48]，采样间隔为

Δ t = 0.1 s

，得到

\begin{aligned} [\begin{matrix} x_{1 (k + 1)} \\ x_{2 (k + 1)} \end{matrix}] = [\begin{array}{c} 0.1 x_{2 k} + x_{1 k} \\ - 0.49 \times \sin (x_{1 k}) - 0.1 \times f_{d} \times x_{2 k} + x_{2 k} \end{array}] \\ + [\begin{array}{c} 0 \\ 0.1 \end{array}] u_{k} \end{aligned}

where

x_{1 k} = θ_{k}

and

x_{2 k} = ω_{k}

. Let the initial state be

x_{0} = [1, - 1]^{T}

. Let the utility function be the quadratic form
在

x_{1 k} = θ_{k}

和

x_{2 k} = ω_{k}

的地方。设初始状态为

x_{0} = [1, - 1]^{T}

。设效用函数为二次形式。

Fig. 6. Numerical results of Example 3 using policy iteration algorithm. (a) Convergence trajectory of iterative performance index function. (b) Iterative controls. © Iterative states. (d) Optimal states and control.
图 6. 使用策略迭代算法的示例 3 的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 迭代控制。(c) 迭代状态。(d) 最优状态和控制。
and the structures of the critic and action networks are 2-12-1 and

2 - 12 - 1

. The initial admissible control law is obtained by Algorithm 1, where the weight matrices are obtained as
评论者和行动网络的结构分别为 2-12-1 和

2 - 12 - 1

。初始可接受控制律通过算法 1 获得，其中权重矩阵的获得为

\begin{aligned} Y_{a, initial} = & [\begin{array}{rr} - 0.6574 & 1.1803 \\ 1.5421 & - 2.9447 \\ - 4.3289 & - 3.7448 \\ 5.7354 & 2.8933 \\ 0.4354 & - 0.8078 \\ - 1.9680 & 3.6870 \\ 1.9285 & 1.4044 \\ - 4.9011 & - 4.3527 \\ 1.0914 & - 0.0344 \\ - 1.5746 & 2.8033 \\ 1.4897 & - 0.0315 \\ 0.2992 & - 0.0784 \end{array}] \\ W_{a, initial} = & [- 2.1429, 1.9276, 0.0060, 0.0030, 4.5618, \\ 3.2266, 0.0005, - 0.0012, 1.3796 \\ - 0.5338, 0.5043, - 5.1110] \end{aligned}

For each iteration step, the critic and action networks are trained for 400 steps using the learning rate of

α = 0.02

to reach

10^{- 5}

. Implement the policy iteration algorithm for 16 iterations to reach the computation precision

10^{- 5}

, and the convergence trajectory of the iterative performance index function is shown in Fig. 6(a). Apply the iterative control laws to the given system for

T_{f} = 100

time steps and the trajectories of the iterative control and states are shown in Fig. 6(b) and ©, respectively. The optimal trajectories of system states and control are shown in Fig. 6(d).
对于每次迭代步骤，评论者和动作网络使用学习率

α = 0.02

训练 400 步，以达到

10^{- 5}

。实现策略迭代算法进行 16 次迭代，以达到计算精度

10^{- 5}

，迭代性能指标函数的收敛轨迹如图 6(a)所示。将迭代控制律应用于给定系统，持续

T_{f} = 100

个时间步，迭代控制和状态的轨迹分别如图 6(b)和(c)所示。系统状态和控制的最优轨迹如图 6(d)所示。

From the numerical results, we can see that using the developed policy iteration algorithm, any of the iterative control law
从数值结果来看，我们可以看到使用开发的策略迭代算法，任何迭代控制律

Fig. 7. Numerical results of Example 3 using value iteration algorithm. (a) Convergence trajectory of iterative performance index function. (b) Iterative controls. © Iterative states. (d) Optimal states and control.
图 7. 使用值迭代算法的示例 3 的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 迭代控制。(c) 迭代状态。(d) 最优状态和控制。
stabilizes the system. However, for the value iteration algorithm, the situation is quite different. For the value iteration algorithm, the initial performance index function

V_{0} (x_{k}) \equiv 0

. Run the value iteration algorithm (39) for 30 iterations to reach the computation precision

10^{- 5}

and trajectory of the performance index function is shown in Fig. 7(a). Applying the iterative control law to the given system (66) for

T_{f} = 100

time steps, we can obtain the iterative states and iterative controls, which are shown in Fig. 7(b) and ©, respectively. The optimal trajectories of control and system states are shown in Fig. 7(d).
稳定系统。然而，对于值迭代算法，情况则大相径庭。对于值迭代算法，初始性能指标函数为

V_{0} (x_{k}) \equiv 0

。运行值迭代算法（39）进行 30 次迭代以达到计算精度

10^{- 5}

，性能指标函数的轨迹如图 7(a)所示。将迭代控制律应用于给定系统（66）进行

T_{f} = 100

个时间步，我们可以获得迭代状态和迭代控制，分别如图 7(b)和(c)所示。控制和系统状态的最优轨迹如图 7(d)所示。

For unstable control systems, although the optimal control law can be obtained by value and policy iteration algorithms, for value iteration algorithm, however, we can see that not all the controls can stabilize the control system. Moreover, the properties of the iterative controls obtained by value iteration algorithm cannot be analyzed and this makes the value iteration algorithm only be implemented offline. For the developed policy iteration algorithm, the stability property can be guaranteed. Hence, we can declare the effectiveness of the policy iteration algorithm in this paper.
对于不稳定的控制系统，尽管可以通过值迭代和策略迭代算法获得最优控制律，但对于值迭代算法，我们可以看到并非所有控制都能使控制系统稳定。此外，通过值迭代算法获得的迭代控制的性质无法分析，这使得值迭代算法只能离线实现。对于开发的策略迭代算法，可以保证其稳定性。因此，我们可以在本文中声明策略迭代算法的有效性。

Example 4: As a real world application of the developed method, the problem of nonlinear satellite attitude control has been selected [49], [50]. Satellite dynamics are represented as
示例 4：作为所开发方法的实际应用，选择了非线性卫星姿态控制问题[49]，[50]。卫星动力学表示为

\frac{d ω}{d t} = Υ^{- 1} (N_{net} - ω \times Υ ω)

where

Υ, ω

, and

N_{net}

are the inertia tensor, angular velocity vector of the body frame with respect to inertial frame, and the vector of the total torque applied on the satellite, respectively. The selected satellite is an inertial pointing satellite. Hence, we are interested in its attitude with respect to the inertial frame. All the vectors are represented in the body frame and the sign

\times

denotes the cross product of two vectors. Let

N_{net} = N_{ctrl} +

N_{dis}

, where

N_{ctrl}

is the control and

N_{dis}

is the disturbance. Following [50] and its order of transformation, the kinematic
其中

Υ, ω

和

N_{net}

分别是惯性张量和相对于惯性框架的机体框架的角速度向量，以及施加在卫星上的总扭矩向量。所选卫星是一个惯性指向卫星。因此，我们对其相对于惯性框架的姿态感兴趣。所有向量都在机体框架中表示，符号

\times

表示两个向量的叉积。设

N_{net} = N_{ctrl} +

N_{dis}

，其中

N_{ctrl}

是控制，

N_{dis}

是干扰。根据 [50] 及其变换顺序，运动学

Fig. 8. Convergence trajectories of the critic network.
图 8. 评论网络的收敛轨迹。
equation of the satellite is
卫星的方程是

\frac{d}{d t} [\begin{array}{l} ϕ \\ θ \\ Ψ \end{array}] = [\begin{array}{ccc} 1 & \sin (ϕ) \tan (ϕ) & \cos (ϕ) \tan (θ) \\ 0 & \cos (ϕ) & - \sin (ϕ) \\ 0 & \sin (ϕ) / \cos (θ) & \cos (ϕ) / \cos (θ) \end{array}] [\begin{matrix} ω_{x} \\ ω_{y} \\ ω_{z} \end{matrix}]

where

ϕ, θ

, and

Ψ

are the three Euler angles describing the attitude of the satellite with respect to

x

y

-, and

z

-axes of the inertial coordinate system, respectively. Subscripts

x, y

, and

z

are the corresponding elements of the angular velocity vector

ω

. The three Euler angles and the three elements of the angular velocity form the elements of the state space for the satellite attitude control problem and form the following state equation:
其中

ϕ, θ

、

Ψ

是描述卫星相对于惯性坐标系的

x

轴、

y

轴和

z

轴的姿态的三个欧拉角。下标

x, y

和

z

是角速度向量

ω

的相应元素。三个欧拉角和三个角速度元素构成卫星姿态控制问题的状态空间元素，并形成以下状态方程：

\dot{x} = f (x) + g (x) u

where 哪里
and

M_{3 \times 1}

denotes the right-hand side of (68). The three-bythree null matrix is denoted by

0_{3 \times 3}

. The moment of inertia matrix of the satellite is chosen as
和

M_{3 \times 1}

表示(68)的右侧。三乘三的零矩阵用

0_{3 \times 3}

表示。卫星的惯性矩阵选择为

Υ = [\begin{array}{ccc} 100 & 2 & 0.5 \\ 2 & 100 & 1 \\ 0.5 & 1 & 110 \end{array}] kg \cdot m^{2}

The initial states are

60^{\circ}, 20^{\circ}

, and

70^{\circ}

for the Euler angles of

ϕ, θ

, and

Ψ

, respectively, and

- 0.001, 0.001

, and

0.001 rad / s

for the angular rates around

x

y

-, and

z

-axes, respectively. For convenience of analysis, we assume that there is no disturbance in the system. Let the sampling time is 0.25 s using the Euler method. Let the utility function be quadratic form where the state and control weight matrices are selected as

Q = diag (0.25, 0.25, 0.25, 25, 25, 25)

and

R =

diag (25, 25, 25)

, respectively.
初始状态为

60^{\circ}, 20^{\circ}

和

70^{\circ}

，分别对应于

ϕ, θ

和

Ψ

的欧拉角，以及

- 0.001, 0.001

和

0.001 rad / s

，分别对应于围绕

x

、

y

和

z

轴的角速度。为了方便分析，我们假设系统中没有干扰。采样时间设为 0.25 秒，使用欧拉方法。效用函数为二次形式，状态和控制权重矩阵分别选为

Q = diag (0.25, 0.25, 0.25, 25, 25, 25)

和

R =

diag (25, 25, 25)

。

Neural networks are used to implement the developed policy iteration algorithm. The critic and action networks are chosen as three-layer BP neural networks with the structures of

6 - 15 - 1

and 6-15-3, respectively. For each iteration step, the critic and action networks are trained for 800 steps using the learning rate of

α = 0.02

so that the neural network training error becomes less than

10^{- 5}

. Implement the policy iteration
神经网络用于实现开发的策略迭代算法。评论者网络和动作网络被选择为三层 BP 神经网络，结构分别为

6 - 15 - 1

和 6-15-3。对于每个迭代步骤，评论者网络和动作网络使用学习率

α = 0.02

训练 800 步，以使神经网络训练误差小于

10^{- 5}

。实现策略迭代。

Fig. 9. Convergence trajectories of the first column of weights of the action network.
图 9. 动作网络第一列权重的收敛轨迹。

Fig. 10. Numerical results of Example 4 using policy iteration algorithm. (a) Convergence trajectory of iterative performance index function. (b) Trajectories of angular velocities

ϕ, θ

, and

Ψ

ω_{x}, ω_{y}

, and

ω_{z}

. (d) Optimal control trajectories.
图 10. 使用策略迭代算法的示例 4 的数值结果。(a) 迭代性能指标函数的收敛轨迹。(b) 角速度

ϕ, θ

和

Ψ

的轨迹。(c) 角速度

ω_{x}, ω_{y}

和

ω_{z}

的轨迹。(d) 最优控制轨迹。
algorithm for 150 iterations. We have proven that the weights of critic and action networks are convergent in each iteration and thus convergent to the optimal ones. The convergence trajectories of critic network weights are shown Fig. 8. The weight convergence trajectories of the first column of action network are shown in Fig. 9.
150 次迭代的算法。我们已经证明，在每次迭代中，评论者和动作网络的权重是收敛的，因此收敛到最优权重。评论者网络权重的收敛轨迹如图 8 所示。动作网络第一列的权重收敛轨迹如图 9 所示。

The iterative performance index functions are shown in Fig. 10(a). After the weights of the critic and action networks are convergent, we apply the neuro-optimal controller to the given system (62) for

T_{f} = 1500

time steps. The optimal state trajectories of

ϕ, θ

, and

Ψ

are shown in Fig. 10(b). The trajectories of angular velocities

ω_{x}, ω_{y}

, and

ω_{z}

are shown in Fig. 10©, and the optimal control trajectories are shown in Fig. 10(d). The numerical results illustrate the effectiveness of the developed policy iteration algorithm.
迭代性能指标函数如图 10(a)所示。在评论者和动作网络的权重收敛后，我们将神经最优控制器应用于给定系统（62）进行

T_{f} = 1500

个时间步。最优状态轨迹

ϕ, θ

和

Ψ

如图 10(b)所示。角速度轨迹

ω_{x}, ω_{y}

和

ω_{z}

如图 10(c)所示，最优控制轨迹如图 10(d)所示。数值结果说明了所开发的策略迭代算法的有效性。

VI. CONCLUSION 六、结论

In this paper, an effective policy iteration ADP algorithm is developed to find the infinite horizon optimal control for
在本文中，开发了一种有效的策略迭代自适应动态规划算法，以寻找无限时域的最优控制
discrete-time nonlinear systems. It is shown that any of the iterative control law can stabilize the control system. It has been proven that the iterative performance index functions are monotonically nonincreasing and convergent to the optimal solution of the HJB equation. Neural networks are used to approximate the performance index function and compute the optimal control policy, respectively, for facilitating the implementation of the iterative ADP algorithm. Finally, four numerical examples are given to illustrate the performance of the developed method.
离散时间非线性系统。研究表明，任何迭代控制律都可以使控制系统稳定。已证明迭代性能指标函数是单调非增的，并收敛于 HJB 方程的最优解。神经网络被用来近似性能指标函数并计算最优控制策略，以便于迭代自适应动态规划（ADP）算法的实现。最后，给出了四个数值示例，以说明所开发方法的性能。

On the other hand, for the discrete-time policy iteration algorithm, some further properties can be discussed. In this paper, we assume that the iterative control law

v_{i} (x_{k})

and the iterative performance index function

V_{i} (x_{k})

can be well approximated by neural networks. Since neural networks always have approximation errors, generally, the exact

v_{i} (x_{k})

and

V_{i} (x_{k})

cannot be achieved. In the case of approximation errors, we should say that the convergence property of the iterative performance index functions and the stability of the system under the iterative control law are not proven. Additional convergence and stability criterions should be established. The convergence and stability properties of the discrete-time policy iteration algorithm with approximation errors will be discussed in our future work.
另一方面，对于离散时间策略迭代算法，可以讨论一些进一步的性质。在本文中，我们假设迭代控制律

v_{i} (x_{k})

和迭代性能指标函数

V_{i} (x_{k})

可以通过神经网络很好地近似。由于神经网络总是存在近似误差，因此通常无法实现精确的

v_{i} (x_{k})

和

V_{i} (x_{k})

。在存在近似误差的情况下，我们应该说迭代性能指标函数的收敛性和在迭代控制律下系统的稳定性尚未得到证明。应建立额外的收敛性和稳定性标准。我们将在未来的工作中讨论带有近似误差的离散时间策略迭代算法的收敛性和稳定性特性。

REFERENCES 参考文献

[1] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton Univ. Press, 1957.
R. E. 贝尔曼，《动态规划》。美国新泽西州普林斯顿：普林斯顿大学出版社，1957 年。
[2] P. J. Werbos, “Advanced forecasting methods for global crisis warning and models of intelligence,” General Syst. Yearbook, vol. 22, pp. 25-38, Jan. 1977.
[2] P. J. Werbos, “全球危机预警的先进预测方法与智能模型，”《一般系统年鉴》，第 22 卷，第 25-38 页，1977 年 1 月。
[3] P. J. Werbos, “A menu of designs for reinforcement learning over time,” in Neural Networks for Control, W. T. Miller, R. S. Sutton, and P. J. Werbos, Ed., Cambridge, MA, USA: MIT Press, 1991, pp. 67-95.
[3] P. J. Werbos, “时间序列强化学习的设计菜单，”收录于《控制的神经网络》，W. T. Miller, R. S. Sutton 和 P. J. Werbos 编辑，剑桥，马萨诸塞州，美国：麻省理工学院出版社，1991 年，第 67-95 页。
[4] T. Dierks and S. Jagannathan, “Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update,” IEEE Trans. Neurl Netw. Learn. Syst., vol. 23, no. 7, pp. 1118-1129, Jul. 2012.
[4] T. Dierks 和 S. Jagannathan，“通过使用基于时间的策略更新对具有未知内部动态的仿射非线性离散时间系统进行在线最优控制，”IEEE Trans. Neurl Netw. Learn. Syst., vol. 23, no. 7, pp. 1118-1129, 2012 年 7 月。
[5] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1671-1676, Oct. 2012.
[5] M. Fairbank, E. Alonso, 和 D. Prokhorov, “神经网络中全局化双重启发式动态规划的二阶梯度的简单快速计算,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1671-1676, 2012 年 10 月。
[6] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming with an application to power systems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, pp. 1150-1156, Jul. 2013.
[6] Y. Jiang 和 Z. P. Jiang, “具有电力系统应用的鲁棒自适应动态规划,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, pp. 1150-1156, 2013 年 7 月。
[7] D. Liu, Y. Zhang, and H. Zhang, “A self-learning call admission control scheme for CDMA cellular networks,” IEEE Trans. Neural Netw., vol. 16, no. 5, pp. 1219-1228, Sep. 2005.
[7] D. Liu, Y. Zhang, 和 H. Zhang, “一种自学习的 CDMA 蜂窝网络呼叫接入控制方案,” IEEE 神经网络汇刊, 第 16 卷, 第 5 期, 第 1219-1228 页, 2005 年 9 月.
[8] Z. Ni, H. He, and J. Wen, “Adaptive learning in tracking control based on the dual critic network design,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 913-928, Jun. 2013.
[8] Z. Ni, H. He, 和 J. Wen, “基于双重评论网络设计的跟踪控制自适应学习,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 913-928, 2013 年 6 月。
[9] D. Nodland, H. Zargarzadeh, and S. Jagannathan, “Neural networkbased optimal adaptive output feedback control of a helicopter UAV,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 7, pp. 1061-1073, Jul. 2013.
[9] D. Nodland, H. Zargarzadeh, 和 S. Jagannathan, “基于神经网络的直升机无人机最优自适应输出反馈控制,” IEEE 神经网络与学习系统汇刊, 第 24 卷, 第 7 期, 页 1061-1073, 2013 年 7 月。
[10] D. Wang, D. Liu, D. Zhao, Y. Huang, and D. Zhang, “A neural-network-based iterative GDHP approach for solving a class of nonlinear optimal control problems with control constraints,” Neural Comput. Appl., vol. 22, no. 2, pp. 219-227, Feb. 2013.
[10] D. Wang, D. Liu, D. Zhao, Y. Huang, 和 D. Zhang, “一种基于神经网络的迭代 GDHP 方法用于解决一类具有控制约束的非线性最优控制问题,” 神经计算与应用, 第 22 卷, 第 2 期, 第 219-227 页, 2013 年 2 月.
[11] H. Xu, S. Jagannathan, and F. L. Lewis, “Stochastic optimal control of unknown linear networked control system in the presence of random delays and packet losses,” Automatica, vol. 48, no. 6, pp. 1017-1030, Jun. 2012.
[11] H. Xu, S. Jagannathan, 和 F. L. Lewis, “在随机延迟和数据包丢失情况下对未知线性网络控制系统的随机最优控制,” Automatica, 第 48 卷, 第 6 期, 第 1017-1030 页, 2012 年 6 月。
[12] H. Xu and S. Jagannathan, “Stochastic optimal controller design for uncertain nonlinear networked control system via neuro dynamic programming,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 471-484, Mar. 2013.
[12] H. Xu 和 S. Jagannathan, “通过神经动态编程设计不确定非线性网络控制系统的随机最优控制器,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 3, pp. 471-484, 2013 年 3 月。
[13] J. Liang, G. K. Venayagamoorthy, and R. G. Harley, “Wide-area measurement based dynamic stochastic optimal power flow control for smart grids with high variability and uncertainty,” IEEE Trans. Smart Grid, vol. 3, no. 1, pp. 59-69, Mar. 2012.
[13] J. Liang, G. K. Venayagamoorthy, 和 R. G. Harley, “基于广域测量的动态随机最优潮流控制用于高变动性和不确定性的智能电网,” IEEE Trans. Smart Grid, 第 3 卷, 第 1 期, 第 59-69 页, 2012 年 3 月.
[14] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 5, pp. 997-1007, Sep. 1997.
[14] D. V. Prokhorov 和 D. C. Wunsch，“自适应评论设计，”IEEE 神经网络汇刊，第 8 卷，第 5 期，页码 997-1007，1997 年 9 月。
[15] X. Xu, Z. Hou, C. Lian, and H. He, “Online learning control using adaptive critic designs with sparse kernel machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 762-775, May 2013.
[15] X. Xu, Z. Hou, C. Lian, 和 H. He, “使用稀疏核机器的自适应评论设计的在线学习控制,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 762-775, 2013 年 5 月。
[16] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Trans. Autom. Sci. Eng., vol. 9, no. 3, pp. 628-634, Jul. 2012.
[16] D. Liu, D. Wang, D. Zhao, Q. Wei, 和 N. Jin, “基于神经网络的未知离散时间非线性系统的最优控制，采用全球化双重启发式编程,” IEEE 自动化科学与工程学报, 第 9 卷, 第 3 期, 第 628-634 页, 2012 年 7 月.
[17] J. J. Murray, C. J. Cox, G. G. Lendaris, and R. Saeks, “Adaptive dynamic programming,” IEEE Trans. Syst., Man, Cybern., Part C, Appl, Rev., vol. 32, no. 2, pp. 140-153, May 2002.
[17] J. J. 穆雷, C. J. 科克斯, G. G. 兰达里斯, 和 R. 萨克斯, “自适应动态编程,” IEEE 系统, 人, 控制论, C 部分, 应用, 评审, 第 32 卷, 第 2 期, 第 140-153 页, 2002 年 5 月.
[18] D. Wang, D. Liu, Q. Wei, D. Zhao, and N. Jin, “Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming,” Automatica, vol. 48, no. 8, pp. 1825-1832, Aug. 2012.
[18] D. Wang, D. Liu, Q. Wei, D. Zhao, 和 N. Jin, “基于自适应动态规划的未知非仿射非线性离散时间系统的最优控制,” Automatica, 第 48 卷, 第 8 期, 页 1825-1832, 2012 年 8 月.
[19] Q. Wei and D. Liu, “An iterative

ϵ

-optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state,” Neural Netw., vol. 32, pp. 236-244, Aug. 2012.
[19] Q. Wei 和 D. Liu, “一种针对一类具有不固定初始状态的离散时间非线性系统的迭代

ϵ

-最优控制方案,” Neural Netw., vol. 32, pp. 236-244, 2012 年 8 月。
[20] Y. Jiang and Z. P. Jiang, “Approximate dynamic programming for optimal stationary control with control-dependent noise,” IEEE Trans. Neurl Netw., vol. 22, no. 12, pp. 2392-2398, Dec. 2011.
[20] Y. Jiang 和 Z. P. Jiang, “带控制依赖噪声的最优静态控制的近似动态编程,” IEEE Trans. Neurl Netw., vol. 22, no. 12, pp. 2392-2398, 2011 年 12 月。
[21] P. J. Werbos, “Approximate dynamic programming for real-time control and neural modeling,” in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Ed., New York, NY, USA: Van Nostrand Reinhold, 1992, ch. 13.
[21] P. J. Werbos, “实时控制和神经建模的近似动态编程，”收录于《智能控制手册：神经、模糊和自适应方法》，D. A. White 和 D. A. Sofge 编，纽约，NY，美国：范诺斯特兰德·雷恩霍尔德，1992 年，第 13 章。
[22] R. Enns and J. Si, “Helicopter trimming and tracking control using direct neural dynamic programming,” IEEE Trans. Neural Netw., vol. 14, no. 4, pp. 929-939, Aug. 2003.
[22] R. Enns 和 J. Si，“使用直接神经动态规划的直升机修整和跟踪控制，”IEEE 神经网络汇刊，卷 14，第 4 期，页 929-939，2003 年 8 月。
[23] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA, USA: Athena Scientific, 1996.
[23] D. P. Bertsekas 和 J. N. Tsitsiklis, 神经动态规划. 美国马萨诸塞州贝尔蒙特: Athena Scientific, 1996.
[24] J. Si and Y.-T. Wang, “On-line learning control by association and reinforcement,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264-276, Mar. 2001.
[24] J. Si 和 Y.-T. Wang, “通过关联和强化进行在线学习控制,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 264-276, 2001 年 3 月。
[25] M. Geist and O. Pietquin, “Algorithmic survey of parametric value function approximation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 845-867, Jun. 2013.
[25] M. Geist 和 O. Pietquin, “参数值函数逼近的算法调查,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 6, pp. 845-867, 2013 年 6 月。
[26] K. S. Hwang and C. Y. Lo, “Policy improvement by a model-free Dyna architecture,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 776-788, May 2013.
[26] K. S. Hwang 和 C. Y. Lo，“通过无模型 Dyna 架构进行政策改进，”IEEE 神经网络与学习系统汇刊，卷 24，第 5 期，页 776-788，2013 年 5 月。
[27] C. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, Dept. Comput. Sci., Cambridge Univ., Cambridge, U.K., 1989.
[27] C. Watkins, “从延迟奖励中学习，”博士论文，剑桥大学计算机科学系，英国剑桥，1989 年。
[28] T. Huang and D. Liu, “A self-learning scheme for residential energy system control and management,” Neural Comput. Appl., vol. 22, no. 2, pp. 259-269, Feb. 2013.
[28] T. Huang 和 D. Liu, “一种用于住宅能源系统控制和管理的自学习方案,” 神经计算应用, 第 22 卷, 第 2 期, 第 259-269 页, 2013 年 2 月.
[29] H. Li and D. Liu, “Optimal control for discrete-time affine nonlinear systems using general value iteration,” IET Control Theory Appl., vol. 6, no. 18, pp. 2725-2736, Dec. 2012.
[29] H. Li 和 D. Liu, “使用一般值迭代的离散时间仿射非线性系统的最优控制,” IET 控制理论与应用, 第 6 卷, 第 18 期, 第 2725-2736 页, 2012 年 12 月.
[30] D. Liu and Q. Wei, “Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 779-789, Apr. 2013.
[30] D. Liu 和 Q. Wei, “基于有限近似误差的离散时间非线性系统最优控制方法,” IEEE Trans. Cybern., vol. 43, no. 2, pp. 779-789, 2013 年 4 月。
[31] H. Modares, F. L. Lewis, and M. B. Naghibi-Sistani, “Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks,” IEEE Trans. Neural Netw. Learn. Syst., Aug. 2013, to be published.
[31] H. Modares, F. L. Lewis, 和 M. B. Naghibi-Sistani, “使用策略迭代和神经网络对未知约束输入系统的自适应最优控制,” IEEE Trans. Neural Netw. Learn. Syst., 2013 年 8 月, 即将出版。
[32] F. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with

ϵ

-error bound,” J. Control Theory Appl., vol. 22, no. 1, pp. 24-36, Jan. 2011.
[32] F. Wang, N. Jin, D. Liu, 和 Q. Wei, “针对具有

ϵ

误差界的离散时间非线性系统的有限时域最优控制的自适应动态规划,” 控制理论与应用, 第 22 卷，第 1 期，页码 24-36，2011 年 1 月。
[33] D. Wang, D. Liu, and Q. Wei, “Finite-horizon neuro-optimal tracking control for a class of discrete-time nonlinear systems using adaptive dynamic programming approach,” Neurocomputing, vol. 78, no. 1, pp. 14-22, Feb. 2012.
[33] D. Wang, D. Liu, 和 Q. Wei, “基于自适应动态规划方法的离散时间非线性系统有限时域神经最优跟踪控制,” Neurocomputing, 第 78 卷, 第 1 期, 页 14-22, 2012 年 2 月.
[34] Q. Wei, H. Zhang, and J. Dai, “Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions,” Neurocomputing, vol. 72, nos. 7-9, pp. 1839-1848, 2009.
[34] Q. Wei, H. Zhang, 和 J. Dai, “无模型多目标近似动态规划用于具有一般性能指标函数的离散时间非线性系统,” 神经计算, 第 72 卷, 第 7-9 期, 第 1839-1848 页, 2009.
[35] H. Zhang, R. Song, Q. Wei, and T. Zhang, “Optimal tracking control for a class of nonlinear discrete-time systems with time delays based on heuristic dynamic programming,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 1851-1862, Dec. 2011.
[35] H. Zhang, R. Song, Q. Wei, 和 T. Zhang, “基于启发式动态规划的具有时间延迟的非线性离散时间系统的最优跟踪控制,” IEEE 神经网络汇刊, 第 22 卷, 第 12 期, 页 1851-1862, 2011 年 12 月.
[36] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32-50, Jul. 2009.
[36] F. L. Lewis 和 D. Vrabie，“反馈控制的强化学习和自适应动态规划，”IEEE 电路与系统杂志，第 9 卷，第 3 期，页 32-50，2009 年 7 月。
[37] F. L. Lewis, D. Vrabie, and K. G. Vamvoudakis, “Reinforcement learning and feedback control: Using natural decision methods to design optimal adaptive controllers,” IEEE Control Syst., vol. 32, no. 6, pp. 76-105, Dec. 2012.
[37] F. L. Lewis, D. Vrabie 和 K. G. Vamvoudakis，“强化学习与反馈控制：使用自然决策方法设计最优自适应控制器，”IEEE 控制系统，卷 32，第 6 期，页 76-105，2012 年 12 月。
[38] R. Beard, “Improving the closed-loop performance of nonlinear systems,” Ph.D. dissertation, Rensselaer Polytechnic Inst., Troy, NY, USA, 1995.
[38] R. Beard, “提高非线性系统的闭环性能，”博士论文，伦斯勒理工学院，美国纽约州特洛伊，1995 年。
[39] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp. 943-949, Aug. 2008.
[39] A. Al-Tamimi, F. L. Lewis, 和 M. Abu-Khalaf, “使用近似动态规划的离散时间非线性 HJB 解：收敛性证明,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp. 943-949, 2008 年 8 月.
[40] H. Zhang, Q. Wei, and Y. Luo, “A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp. 937-942, Jul. 2008.
[40] H. Zhang, Q. Wei, 和 Y. Luo, “一种基于贪婪 HDP 迭代算法的离散时间非线性系统的新型无限时间最优跟踪控制方案,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., 第 38 卷，第 4 期，937-942 页，2008 年 7 月。
[41] D. Liu, H. Li, and D. Wang, “Neural-network-based zero-sum game for discrete-time nonlinear systems via iterative adaptive dynamic programming algorithm,” Neurocomputing, vol. 110, pp. 92-100, Jun. 2013.
[41] D. Liu, H. Li, 和 D. Wang, “基于神经网络的离散时间非线性系统的零和博弈通过迭代自适应动态规划算法,” Neurocomputing, vol. 110, pp. 92-100, 2013 年 6 月。
[42] D. Liu, D. Wang, and X. Yang, “An iterative adaptive dynamic programming algorithm for optimal control of unknown discretetime nonlinear systems with constrained inputs,” Inf. Sci., vol. 220, pp. 331-342, Jan. 2013.
[42] D. Liu, D. Wang, 和 X. Yang, “一种用于未知离散时间非线性系统的约束输入最优控制的迭代自适应动态规划算法,” 信息科学, 第 220 卷, 第 331-342 页, 2013 年 1 月。
[43] M. Abu-Khalaf and F. L. Lewis, “Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach,” Automatica, vol. 41, no. 5, pp. 779-791, May 2005.
[43] M. Abu-Khalaf 和 F. L. Lewis, “使用神经网络 HJB 方法的饱和执行器非线性系统的近似最优控制律,” Automatica, 第 41 卷, 第 5 期, 第 779-791 页, 2005 年 5 月.
[44] H. Zhang, Q. Wei, and D. Liu, “An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games,” Automatica, vol. 47, no. 1, pp. 207-214, Jan. 2011.
[44] H. Zhang, Q. Wei, 和 D. Liu, “一种迭代自适应动态规划方法用于解决一类非线性零和微分博弈,” Automatica, 第 47 卷, 第 1 期, 页 207-214, 2011 年 1 月.
[45] K. G. Vamvoudakis, F. L. Lewis, and G. R. Hudas, “Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598-1611, 2012.
[45] K. G. Vamvoudakis, F. L. Lewis, 和 G. R. Hudas, “多智能体微分图形游戏：同步的在线自适应学习解决方案与最优性,” Automatica, 第 48 卷, 第 8 期, 页 1598-1611, 2012.
[46] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon, “A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems,” Automatica, vol. 49, no. 1, pp. 82-92, Jan. 2013.
[46] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, 和 W. E. Dixon, “一种新颖的演员-评论家-识别器架构用于不确定非线性系统的近似最优控制,” Automatica, 第 49 卷, 第 1 期, 第 82-92 页, 2013 年 1 月。
[47] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 2226-2236, Dec. 2011.
[47] H. Zhang, L. Cui, X. Zhang, 和 Y. Luo, “基于数据的鲁棒近似最优跟踪控制用于未知一般非线性系统的自适应动态规划方法,” IEEE 神经网络汇刊, 第 22 卷, 第 12 期, 第 2226-2236 页, 2011 年 12 月.
[48] S. K. Gupta, Numerical Methods for Engineerings. New Delhi, India: New Age Int. Company, 1995.
[48] S. K. Gupta, 工程的数值方法. 新德里, 印度: 新时代国际公司, 1995.
[49] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145-157, Jan. 2013.
[49] A. Heydari 和 S. N. Balakrishnan, “使用单网络自适应评论者的有限时间控制约束非线性最优控制,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145-157, 2013 年 1 月。
[50] J. R. Wertz, Spacecraft Attitude Determination and Control. Amsterdam, The Netherlands: Kluwer, 1978, pp. 521-570.
[50] J. R. Wertz, 航天器姿态确定与控制. 荷兰阿姆斯特丹: Kluwer, 1978, pp. 521-570.

Derong Liu (S’91-M’94-SM’96-F’05) received the Ph.D. degree in electrical engineering from the University of Notre Dame, South Bend, IN, USA, in 1994 .
刘德荣（S’91-M’94-SM’96-F’05）于 1994 年获得美国印第安纳州南本德的圣母大学电气工程博士学位。
He was a Staff Fellow with the General Motors Research and Development Center, Warren, MI, USA, from 1993 to 1995. He was an Assistant Professor with the Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ, USA, from 1995 to 1999. He joined the University of Illinois at Chicago, Chicago, IL, USA, in 1999, and became a Full Professor of electrical and computer engineering and of computer science in 2006. He was selected for the 100 Talents Program by the Chinese Academy of Sciences in 2008. He has published 14 books (six research monographs and eight edited volumes).
他于 1993 年至 1995 年在美国密歇根州沃伦的通用汽车研究与开发中心担任研究员。1995 年至 1999 年，他在美国新泽西州霍博肯的史蒂文斯理工学院电气与计算机工程系担任助理教授。1999 年，他加入美国伊利诺伊大学芝加哥分校，并于 2006 年成为电气与计算机工程及计算机科学的全职教授。他于 2008 年被中国科学院选为百人计划人才。他已出版 14 本书（六部研究专著和八部编辑卷）。
Dr. Liu is currently the Editor-in-Chief of the IEEE Transactions on Neural Networks and Learning Systems. He received the Michael J. Birck Fellowship from the University of Notre Dame in 1990, the Harvey N. Davis Distinguished Teaching Award from Stevens Institute of Technology in 1997, the Faculty Early Career Development CAREER Award from the National Science Foundation in 1999, the University Scholar Award from University of Illinois from 2006 to 2009, and the Overseas Outstanding Young Scholar Award from the National Natural Science Foundation of China in 2008. He is a fellow of the INNS.
刘博士目前是《IEEE 神经网络与学习系统汇刊》的主编。他于 1990 年获得了圣母大学的迈克尔·J·伯克奖学金，1997 年获得了史蒂文斯理工学院的哈维·N·戴维斯杰出教学奖，1999 年获得了国家科学基金会的教师早期职业发展 CAREER 奖，2006 年至 2009 年获得了伊利诺伊大学的大学学者奖，并于 2008 年获得了中国国家自然科学基金的海外杰出青年学者奖。他是国际神经网络学会的会士。

Qinglai Wei (M’11) received the B.S. degree in automation, the M.S. degree in control theory and control engineering, and the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2002, 2005, and 2008, respectively.
魏青来（M’11）于 2002 年、2005 年和 2008 年分别获得中国沈阳东北大学自动化学士学位、控制理论与控制工程硕士学位和控制理论与控制工程博士学位。

He was a Post-Doctoral Fellow with Institute of Automation, Chinese Academy of Sciences, Beijing, China, from 2009 to 2011. He is currently an Associate Professor with The State Key Laboratory of Management and Control for Complex Systems. His current research interests include neural networks-based control, adaptive dynamic programming, optimal control, nonlinear system, and their industrial applications.
他于 2009 年至 2011 年在中国科学院自动化研究所担任博士后研究员。目前，他是复杂系统管理与控制国家重点实验室的副教授。他目前的研究兴趣包括基于神经网络的控制、自适应动态规划、最优控制、非线性系统及其工业应用。

Manuscript received August 2, 2012; accepted August 27, 2013. Date of publication September 25, 2013; date of current version February 14, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61034002, Grant 61233001, Grant 61273140, and Grant 61374105, in part by Beijing Natural Science Foundation under Grant 4132078, and in part by the Early Career Development Award of SKLMCCS. The acting Editor-in-Chief who handled the review of this paper was Ivo Bukovsky.
手稿于 2012 年 8 月 2 日收到；2013 年 8 月 27 日接受。出版日期为 2013 年 9 月 25 日；当前版本日期为 2014 年 2 月 14 日。本研究部分得到了中国国家自然科学基金的资助，资助号为 61034002、61233001、61273140 和 61374105，部分得到了北京市自然科学基金的资助，资助号为 4132078，部分得到了 SKLMCCS 的早期职业发展奖。负责审稿的主编是伊沃·布科夫斯基。
The authors are with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (email: derong.liu@ia.ac.cn; qinglai.wei@ia.ac.cn).
作者来自中国科学院自动化研究所复杂系统管理与控制国家重点实验室，地址：北京市 100190，中国（电子邮件：derong.liu@ia.ac.cn; qinglai.wei@ia.ac.cn）。

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
本文中一个或多个图的彩色版本可在线获取，网址为 http://ieeexplore.ieee.org。
Digital Object Identifier 10.1109/TNNLS.2013.2281663
数字对象标识符 10.1109/TNNLS.2013.2281663

Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems 离散时间非线性系统的策略迭代自适应动态规划算法

Abstract 摘要

I. InTRODUCTION 一. 引言

II. Problem Formulation II. 问题表述

III. Policy ItERation AlgorithmIII. 政策迭代算法

A. Derivation of the Policy Iteration AlgorithmA. 政策迭代算法的推导

B. Properties of the Policy Iteration AlgorithmB. 策略迭代算法的性质

C. Obtaining the Initial Admissible Control LawC. 获取初始可接受控制律

Initialization: 初始化：

Iteration: 迭代：

Algorithm 2 ― Algorithm 2 _ " Algorithm "2_\underline{\text { Algorithm } 2} Discrete-Time Policy Iteration Algorithm Initialization:离散时间策略迭代算法初始化：

Iteration: 迭代：

D. Summary of the Policy Iteration Algorithm of ADPD. 自适应动态规划（ADP）政策迭代算法的总结

IV. NEural NETwORK IMPLEMENTATION FOR THE Optimal CONTROL ScHEMEIV. 神经网络在最优控制方案中的实现

A. Critic Network A. 评论网络

B. Action Network B. 行动网络

V. Numerical Analysis V. 数值分析

VI. CONCLUSION 六、结论

REFERENCES 参考文献

Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems
离散时间非线性系统的策略迭代自适应动态规划算法

III. Policy ItERation Algorithm
III. 政策迭代算法

A. Derivation of the Policy Iteration Algorithm
A. 政策迭代算法的推导

B. Properties of the Policy Iteration Algorithm
B. 策略迭代算法的性质

C. Obtaining the Initial Admissible Control Law
C. 获取初始可接受控制律

$\underset{―}{Algorithm 2}$ Discrete-Time Policy Iteration Algorithm Initialization:
离散时间策略迭代算法初始化：

D. Summary of the Policy Iteration Algorithm of ADP
D. 自适应动态规划（ADP）政策迭代算法的总结

IV. NEural NETwORK IMPLEMENTATION FOR THE Optimal CONTROL ScHEME
IV. 神经网络在最优控制方案中的实现