这是用户在 2025-1-2 20:46 为 https://onlinelibrary.wiley.com/doi/full/10.1155/2018/5781591 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Research Article
Open Access

Dynamic Path Planning of Unknown Environment Based on Deep Reinforcement Learning
基于深度强化学习的未知环境动态路径规划

Xiaoyun Lei

Xiaoyun Lei

Key Laboratory of Intelligent Ammunition Technology, School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China njust.edu.cn

Search for more papers by this author
Zhian Zhang

Corresponding Author

Zhian Zhang

Key Laboratory of Intelligent Ammunition Technology, School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China njust.edu.cn

Search for more papers by this author
Peifang Dong

Peifang Dong

Key Laboratory of Intelligent Ammunition Technology, School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China njust.edu.cn

Search for more papers by this author
First published: 18 September 2018
Citations: 87

首次发表时间: 2018 年 9 月 18
引用次数: 87
Academic Editor: Gordon R. Pennock
学术编辑:戈登·彭诺克

Abstract  抽象

Dynamic path planning of unknown environment has always been a challenge for mobile robots. In this paper, we apply double Q-network (DDQN) deep reinforcement learning proposed by DeepMind in 2016 to dynamic path planning of unknown environment. The reward and punishment function and the training method are designed for the instability of the training stage and the sparsity of the environment state space. In different training stages, we dynamically adjust the starting position and target position. With the updating of neural network and the increase of greedy rule probability, the local space searched by agent is expanded. Pygame module in PYTHON is used to establish dynamic environments. Considering lidar signal and local target position as the inputs, convolutional neural networks (CNNs) are used to generalize the environmental state. Q-learning algorithm enhances the ability of the dynamic obstacle avoidance and local planning of the agents in environment. The results show that, after training in different dynamic environments and testing in a new environment, the agent is able to reach the local target position successfully in unknown dynamic environment.
未知环境的动态路径规划一直是移动机器人面临的挑战。在本文中,我们将 DeepMind 于 2016 年提出的双 Q 网络 (DDQN) 深度强化学习应用于未知环境的动态路径规划。针对训练阶段的不稳定性和环境状态空间的稀疏性,设计了奖惩函数和训练方法。在不同的训练阶段,我们会动态调整起始位置和目标位置。随着神经网络的更新和贪婪规则概率的增加,agent 搜索的局部空间被扩大。PYTHON 中的 Pygame 模块用于建立动态环境。以激光雷达信号和局部目标位置为输入,采用卷积神经网络 (CNN) 对环境状态进行泛化。Q-learning算法增强了智能体在环境中的动态避障和局部规划能力。结果表明,在不同动态环境中训练并在新环境中进行测试后,智能体能够在未知的动态环境中成功到达局部目标位置。

1. Introduction  1. 引言

Since deep reinforcement learning [1] was first proposed in 2013 formally, tremendous progress has been made in the field of artificial intelligence [2]. Deep Q-network agent was demonstrated to be able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester in Atari 2600 games [37]. AlphaGo zero [8, 9] has defeated all previous AlphaGo versions by self-play without using any human chess spectrum. An agent can be trained to play FPS game receiving only pixels and game score as inputs [10]. The aforementioned examples fully demonstrate the great potential in autonomous decision-making field after the reinforcement learning of neural network solving the problem of the curse of dimensionality. In [11, 12], deep reinforcement learning has been applied to autonomous navigation based on the inputs of visual information, which has achieved remarkable success. In [11], Piotr Mirowski et al. highlighted the utility of un/self-supervised auxiliary objectives, namely, depth prediction and loop closure, in providing richer training signals that bootstrap learning and enhance data efficiency. The authors analyze the agent behavior in static mazes feature complex geometry, random start position and orientation, and dynamic goal locations. Their results show that their approach enable the agent navigate within large and visually rich environments that include frequently changing start and goal locations, but the maze layout itself is static. The authors did not test the algorithm in environments with moving obstacles. If there are moving obstacles in environments, this means the images collected by cameras may be unknown for each episode. In [12], Yuke Zhu et al. try to find the minimum length sequence of actions that move an agent from its current location to a target that is specified by an RGB image. To solve the problem of a lack of generalization, i.e., the network should be retrained for new targets, they specify the task objective (i.e., navigation destination) as inputs to the model and addresses problems by introducing shared Siamese layers to the network. But as it is mentioned in the paper, in the network architecture of deep Siamese actor-critic model, the ResNet-50 layers are pretrained on ImageNet and fixed during training. This means that they have to collect a large number of different target and scene images with a constrained background to pretrain the ImageNet before training the navigation model, indicating that the generalization ability is still conditioned to the information of maps in advance.
自 2013 年首次正式提出深度强化学习 [1] 以来,人工智能领域取得了巨大进展 [2]。深度 Q 网络代理被证明能够超越所有以前算法的性能,并达到与 Atari 2600 游戏中的专业人类游戏测试员相当的水平 [37]。AlphaGo zero [89] 已经通过不使用任何人类国际象棋频谱的自下击败了所有以前的 AlphaGo 版本。可以训练代理玩 FPS 游戏,仅接收像素和游戏分数作为输入 [10]。上述例子充分展示了神经网络强化学习解决维数诅咒问题后,自主决策领域的巨大潜力。在 [1112] 中,深度强化学习已应用于基于视觉信息输入的自主导航,并取得了显著的成功。在 [11] 中,Piotr Mirowski 等人强调了非/自我监督辅助目标(即深度预测和循环闭合)在提供更丰富的训练信号方面的作用,这些信号可以引导学习和提高数据效率。作者分析了静态迷宫中的代理行为,特征复杂的几何形状、随机的起始位置和方向以及动态目标位置。他们的结果表明,他们的方法使代理能够在大型且视觉丰富的环境中导航,其中包括频繁变化的起点和目标位置,但迷宫布局本身是静态的。 作者没有在有移动障碍物的环境中测试该算法。如果环境中有移动障碍物,这意味着摄像机收集的图像对于每一集可能是未知的。在 [12] 中,Yuke Zhu 等人试图找到将代理从其当前位置移动到由 RGB 图像指定的目标的动作的最小长度序列。为了解决缺乏泛化的问题,即应该针对新目标重新训练网络,他们将任务目标(即导航目的地)指定为模型的输入,并通过向网络引入共享孪生层来解决问题。但正如论文中提到的,在深度孪生演员-评论家模型的网络架构中,ResNet-50 层在 ImageNet 上进行了预训练,并在训练期间进行了修复。这意味着在训练导航模型之前,他们必须收集大量具有不同背景的不同目标和场景图像来预训练 ImageNet,这表明泛化能力仍然提前对地图的信息进行了条件化。

In this paper, we present a novel path planning algorithm and solve the generalization problem by means of local path planning with deep reinforcement learning DDQN based on lidar sensor information. In the aspect of the recent deep reinforcement learning models, the original training mode results in a large number of samples which are moving states in the free zone in the pool, and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence. So, we constrain the starting position and target position by randomly setting target position in the area that is not occupied by the obstacles to expand the state space distribution of the pool of sample.
在本文中,我们提出了一种新的路径规划算法,并通过基于激光雷达传感器信息的深度强化学习 DDQN 的局部路径规划来解决泛化问题。在最近的深度强化学习模型方面,原来的训练模式导致大量样本在池中的自由区中移动状态,缺乏试错惩罚样本和目标奖励样本最终导致算法收敛。因此,我们通过在障碍物未占据的区域随机设置目标位置来约束起始位置和目标位置,以扩大样本池的状态空间分布。

To evaluate our algorithm, we use TensorFlow to build the DDQN training frameworks for simulation and demonstrate the approach in real world. In simulations, the agent is trained in a lower-level and intermediate dynamic environment. The starting point and target point are randomly generated to ensure diversity and complexity of local environment, and the test environment is a high-level dynamic map. We show details of the agent’s performance in an unseen dynamic map in the real world.
为了评估我们的算法,我们使用 TensorFlow 构建用于模拟的 DDQN 训练框架,并在现实世界中演示该方法。在模拟中,代理在较低级别和中间的动态环境中进行训练。起点和目标点是随机生成的,以保证局部环境的多样性和复杂性,测试环境是一张高层次的动态地图。我们在现实世界中看不见的动态地图中显示代理的性能细节。

2. Deep Reinforcement Learning with DDQN Algorithm

The conventional Q-learning algorithm [1] cannot effectively plan a path in random dynamic environment because of the lack of generalization ability and a large Q table. To solve the problem of the curse of dimensionality in high dimensional state space, the optimal action value function Q in Q-learning can be parameterized by an approximate value function.
mathematical equation (1)
where θ is the Q-network parameter. We approximate the value of Q in a definite environment state by function equation, which is mainly linear approximation; thus, there is no need to build a large Q table to determine the corresponding Q values for different state-actions. Neural network is used to approximate the linear function which can obtain nonlinear approximation with generalization ability. (s, a) is regarded as the input of the neural network and the output is the value of Q, where θ is the weight of Q-network. Q table is replaced by Q-network. The training process is to constantly adjust the network weights to reduce the bias of the output of Q-network and the target value of Q. Assuming that the target value of Q is denoted by y, thus the loss function of Q-network is yielded:
mathematical equation (2)

where iteration i is the current iteration times and (s, a) is a pair of state-action.

The update mode of the value of Q is similar to that of Q-learning; that is,
mathematical equation (3)

where r is the current reward and γ is a discount factor.

Stochastic gradient descent algorithm is adopted to train the neural network. According to the back propagation of the derivative value of loss function, the network weights are constantly adjusted, resulting in the network output approaching the target value of Q. The network gradient can be derived according to (2) and (3).
mathematical equation (4)

Equation (4) indicates that the updates of neural network and Q-learning are simultaneous. The Q-network of the current iteration epoch is updated with the target value of Q updated from the previous iteration epoch.

Reinforcement learning is known to be unstable or even to diverge when a nonlinear function approximator such as a neural network is used to represent the action value (also known as Q) function [13]. This instability has several causes: the correlations present in the sequence of observations and the correlations between the action values (Q) and the target values. To address these instabilities, DeepMind presents a biologically inspired mechanism termed experience replay [1416] that randomizes over the data. To perform experience replay, DeepMind stores the agent’s experiences, environment state, action, and reward at each time-step t in a dataset. The pool of data samples is extended with running e-greedy policy to search the environment. During the learning, train the network with random data from the pool of stored samples, thereby removing correlations in the observation sequence and improving the stability of algorithm.

In 2015, DeepMind improved the original algorithm in the paper “Human-Level Control through Deep Reinforcement Learning” published in Nature. They added a target Q-network, which is delayed updating in comparison with the predicted Q-network, to update the target values of Q, thereby restraining a big bias of Q resulting from a fast dynamic updating of the target Q-network. The improved loss function is
mathematical equation (5)

where θi are the parameters of the Q-network at iteration i and mathematical equation are the network parameters used to compute the target at iteration i. The target network parameters mathematical equation are only updated with the Q-network parameters θi every C steps and are held fixed between individual updates, thus ensuring that the update of target network is delayed and the estimate of Q is more accurate.

The update mode of Q-learning algorithm results in a problem of overestimate action values. The algorithm estimates the value of a certain state too optimistically, consequently causing that the Q value of subprime action is greater than that of the optimal action, thereby changing the optimal action and affecting the accuracy of the algorithm. To address the overestimate problem, in 2016, DeepMind presented an improved algorithm in the paper “Deep Reinforcement Learning with Double Q-Learning”, namely, Double Q-network algorithm [17], which selected an action by maximizing the value of the predicted Q-network and updating the predicted Q-network with the selected action value in Q-network instead of directly updating the predicted Q-network with the maximum value of target Q-network.
mathematical equation (6)
The loss function of the improved algorithm is
mathematical equation (7)

The framework illustration of DDQN is shown in Figure 1.

Details are in the caption following the image
The framework illustration of DDQN.

3. Local Path Planning with DDQN Algorithm

In this paper, we achieve local path planning with deep reinforcement learning DDQN. Lidar is used to detect the environment information over 360 degrees. The sensor range is regarded as observation window. The accessible point nearest to the global path at the edge of the observation window is considered as the local target point of the local path planning. The network receives lidar dot matrix information and local target point coordinates as the inputs and outputs the direction of movement.

Considering the computation burden and actual navigation effect, we set angle resolution to be 1 degree and range limit to be 2 meters so each observation consists of 360 points indicating the distance to obstacles within a two-meter circle around the robot. The local target point is the intersection point of the observation window circular arc and the global path. If there were several intersection points, the optimal point would be chosen by heuristic evaluation.

As to the lidar dot matrix information, the angle and distance can be denoted by [angle, distance]. We set a rule that the angles of the lidar points increases clockwise relative to the front of the mobile robot. 360 points result in 720 inputs of data. The relative coordinate of local target point and the current center coordinate of the mobile robot body ([Δx, Δy]) are also regarded as the input. To achieve the network output weights of the relative target points in training and conveniently design the convolutional network, we make 40 copies of the relative target points as the inputs; namely, the total inputs are 800 datasets.

The outputs are fixed-length omnidirectional movements of eight directions, where the movement length is 10 cm, and the directions are forward, backward, left, right, left front, left rear, right front, and right rear, which are denoted by 1 to 8 in order as follows.
mathematical equation (8)
The design of positive activation function mainly accounts for obstacle avoidance and approaching the target point. The shortest movement path which satisfied the two conditions is the most effective. The rewards and punishments function is designed as
mathematical equation (9)

where p is the current position, g is a local target point, and o represents an obstacle with expansion processing. Equation (9) indicates that the reword would be -1 if the agent encountered an obstacle; if the agent arrived at the target point, it would get a reward +1; in the other states, each step of movement would cost -0.01. During the training, in order to continuously improve the cumulative reward, the agent gradually avoids the obstacle and reaches the target point in a short path by the method of the trial-and-error learning.

Each lidar detected point receives continuous measurement values changing from 0 cm to 200 cm, which means that the state space tends towards infinitude and it is impossible to represent all of the states with Q table. With DDQN, the generalization ability of neural network makes it possible that the network is able to approach all states after training; therefore, when the environment changed, the agent could plan a proper path and arrive at the target position according to the weights of the network.

To ensure the deep reinforcement learning training converging normally, the pool should be large enough to store state-action of each time-step and keep the training samples of neural network be independent identical distribution; besides, the environment punishment and reward should reach to a certain proportion. If the sample space was too sparse, namely, the main states were random movements in free space, it was difficult to achieve a stable training effect. As to the instability of DDQN in training and the reward sparsity of the state space, the starting point is randomly set in the circle with the target point as the center and the radius L. The initial value of L is small, thereby increasing the probability that the agent reaches the target point from the starting point in the random exploration and ensuring a positive incentive in the sample space. The value of L gradually increases with the update of neural network and the increase of greedy law probability. The local space searched by agent is expanded as follows.
mathematical equation (10)

where n is the current iteration time-step; imin is the initial value of L; imax is the maximum value of L; m is the search speed in space; N1 and N2 are thresholds of iteration times which need to be adjusted according to training parameters.

The termination of each episode is to achieve a fixed number of moving steps instead of directly terminating the current training episode when encountering obstacles or reaching the target point. The original training mode results in a large number of samples which are moving states in the free zone in the pool, and the lack of trial-and-error punishment samples and target reward samples ultimately leads to algorithm disconvergence.

4. The Neural Network in DDQN

Neural network is a primary method for reinforcement learning to enhance the ability of generalization. The optimization design of the neural network architecture is able to reduce overfitting and improve the prediction accuracy with high training speed and low computational cost.

Considering 800 inputs, if we directly adopted fully connected layers, the number of parameters needed to be trained would increase exponentially with the increase of the layers, resulting in a heavy computation burden and overfitting. Convolutional neural networks (CNNs) have made breakthrough progress in the field of image recognition in recent years. CNNs are featured with a unique network processing mode which is characteristic of local connection, pooling, sharing the weights, etc. The unique mode effectively reduces the computational cost, computational complexity, and the number of the trained parameters. With the method of CNNs, the image model in the translation, scaling, distortion, and other transformations is able to be invariant to a certain degree, thereby improving the robustness of system to failures. Based on these superior features, CNNs surpass fully connected neural network in information processing task where the data are relevant to each other [18].

The framework of CNNs is shown in Figure 2. Generally, the CNNs consist of input layer, convolutional layer, pooling layer, fully connected layer, and output layer. convolutional layer adopts local connection; namely, each node is connected with some nodes of the upper layer. Four adjacent pixels of input layer form a 2-by-2 local sensory field which is connected with a neuron in convolutional layer. Each local sensory field represents a local feature, and the size of the field can be adjusted artificially.

Details are in the caption following the image
The architecture of CNNs.

The next layer of convolutional layer is pooling layer, which can also reduce the matrix size. The spatial invariant feature is obtained by reducing the resolution of the feature surface [19]. The number of feature maps in the pooling layer is equal to the number of convolutional layers, and each neuron performs a pooling operation on the local sensory field. Pooling is suitable for separating the sparse features.

A fully connected layer is added after one or more convolutional layers and pooling layers. The fully connected layer integrates the partial information involved in category distinguishing in convolutional layers or pooling layers [20]. Besides, rectified linear unit (ReLU) is a recent used activation functions in CNNs, which can improve the performance of a CNN. The output value of the fully connected layer is directly transferred to the output layer. For deep reinforcement learning, the output layer is the action space.

The lidar data between adjacent points reflects the distribution of obstacles and the width of the free zone. The convolutional network can effectively extract the information of these characteristics and reduce the network parameters greatly. Therefore, the CNNs are used to train the path planning of local dynamic environment.

In this paper, the architecture of Q-network consists of three convolutional layers and a fully connected layer. The input layer is a three-dimensional matrix with the size of 20×20×2 formed by a vector with 800 elements, where the data in the third dimension represents the angle and distance of a lidar point. The size of the input layer is 20×20×2. According to the size of the input layer, we design the first convolutional layer, of which the size of the receptive field is 2-by-2, the stride is 2-by-2, and the number of feature maps is set to be 16. Thus, the size of the output layer is 10×10×16. The kernel size of the second convolutional layer is 2-by-2; the stride is 2-by-2; and the number of the characteristic plane is 32. The size of the output of this convolutional layer is 5×5×32. The kernel size of the third convolutional layer is 5-by-5; the stride is 1-by-1; and the number of the characteristic plane is 128. The size of the output of the third convolutional layer is 1×1×128. Then a three-dimensional structure is transformed to a one-dimensional vector with 128 elements, which are connected to the fully connected layer, of which the size is 128-by-256. The size of the output layer is the number of the actions, namely, 8. We use ReLU activation function and Adam optimizer.

5. Local Path Planning Simulation with Pygame Module

We use the open source machine learning framework TensorFlow to build the DDQN training framework. Pygame module is to build dynamic environment. As shown in Figure 3, the white area denotes obstacles, with the size of 20-by-20 cm, are moving in a constant speed, and the corresponding blue arrow represents its moving range and directions. The green point denotes a local target position. The gray square object represents the mobile robot (agent). During the training, the local path planning is trained in the lower-level and intermediate environment. The starting point and target point are randomly generated to ensure diversity and complexity of local environment, and the test environment is a high-level dynamic map. Two CNNs of the same architecture are established, which are Q estimation network and Q target network. The network parameters are randomly initialized with normal distribution, of which the mean value is 0.

Details are in the caption following the image
Figure 3 (a) Low level environment
Dynamic environments.
Details are in the caption following the image
Figure 3 (b) Intermediate environment
Dynamic environments.
Details are in the caption following the image
Figure 3 (c) High-level environment
Dynamic environments.

The training policy is a variable random greedy rule. At the beginning of the training, the pool of experience is updated by the method of random exploitation because of the lack of environmental information. The sample consists of five components, namely, [s, a, r, s1, d], where s denotes the current state; a denotes the adopted action; s1 is the state after taking action; R is the reward value obtained for the current state transition; d is a flag representing that whether the current iteration epoch ends. We set the size of the pool to 40000 samples. If the samples stored in the pool reached a certain number, the network was to be trained with the randomly selected samples in the pool. In the first 5000 time-steps of the random exploration, the network parameters are not updated, but the samples of the pool are increased. After the sample size reaches to 5000, the network is trained in every four time-step movements. We randomly draw 32 samples in sequence by means of minibatch to update the Q estimation network and update the Q target network with a low learning rate to make the parameters of Q target network approach the parameters of Q estimation network, thereby ensuring a stable learning of Q target network. If the sample size of the pool was up to the upper limit, the earliest sample would be excluded from the pool in the policy of first-in first-out after a new sample is added, thereby ensuring a continuous updating of the pool.

According to the actual experimental effect, if the position of agent and the target point were completely randomly set at first, there would be a large probability that the distance between the agent and target point was too large, resulting in that the agent was unable to reach the target point in fixed steps with random exploration. Consequently, to ensure the target point being detected by the agent, we set the initial distance between the agent and target point randomly in a range of L and gradually increase L with exploration and training. This process is also a process of constantly changing the environmental states which leads to a more extensive distribution of samples.

During the training process, the values of loss function of Q estimation network and Q target network decrease continuously. Figure 4 shows the training curves of the loss function of Q target network. Each point is the average loss function value achieved per ten epochs. Figure 4 (1) indicates that, in initial stage of the training, the magnitude of loss function value is 107. while, as we got from our data, after training for 1000 epochs, actually, the value dropped to about 10,000 (seen from the figure, the curve tends to be close to x-axis); after 7000 epochs, as we can see from Figure 4 (2) that it dropped to less than 1. With the whole process completed, the loss function value is less than 0.25, and the Q-network converges successfully. The cumulative reward of each iteration is gradually increased; that is, in the process of exploration, the agent gradually learns to take an action which can increase the reward. Figure 5 shows the average cumulative reward curves. Each point is the average cumulative reward achieved per hundred epochs. The figure indicates that the average reward gradually increases with the increase of training epochs. After training for 30000 epochs, the reward value tends to be stable and average cumulative reward is greater than 80. We set the fixed step times to be 100, which means the agent can avoid obstacles in a dynamic environment and quickly reach the target point to obtain a continuous reward until the current training epoch ends.

Details are in the caption following the image
Training curves of the loss function of Q target network. Each point is the average loss function value achieved per ten epochs. The y-axis denotes the value of loss function and x-axis denotes iteration epoch.
Details are in the caption following the image
The average cumulative reward curves. Each point is the average cumulative reward achieved per hundred episodes. The y-axis denotes the average cumulative reward and x-axis denotes iteration epoch.

After training for 40000 epochs, Q estimation network and Q target network converge. We store the network parameters and test in an experimental environment map. The test is designed as the following. Assuming a global path, we set several local target points in the lidar searching area regardless of the positions of the dynamic and static obstacles to test the agent’s local path planning ability. Figure 6 is a new environmental map which has never tested in training. This map is used to evaluate the generalization ability and the path planning ability of DDQN in a new unknown dynamic environment. There are three free moving obstacles with constant speed and the moving directions are denoted by the blue arrows. In the figure, state A demonstrates that the local path is blocked by dynamic obstacle No. 2; then, the agent waits for the obstacle to move downwards. When the obstacle is out of the path, the agent moves towards upper right. As shown in state B, the agent successfully reaches the third local target. In state C, obstacle No. 3 moves towards the agent. The agent perceives obstacle No. 3 before collision and gets to the sixth local target point by moving toward the bottom right. In state D, the agent reaches the end point and completes the path planning without any accident collision with the obstacles. The whole process demonstrates that the agent after being trained by DDQN is able to perceive the moving obstacles in advance with the knowledge of lidar data in unknown dynamic environment. An intelligent planning method is presented by Q target network which makes each step move towards a higher cumulative reward.

Details are in the caption following the image
Local path planning in a test map.

6. Test in Actual Environment with ROS Framework

In order to verify the effect of the algorithm in actual environment, we use an omnidirectional mobile robot based on Mecanum Wheels and achieve autonomous navigation with ROS framework [2123]. The move_base package in ROS provides local path planning algorithms for user, that is, dynamic windows method and track-reckoning method. In this paper, DDQN local path planning algorithm is transplanted into ROS in the form of plug-ins, which is encapsulated in the pure virtual function base_local_planner of nov_core as subclass, namely, DDQN_local_planner. The outputs of the deep reinforcement learning algorithm are the translational movements of the agent in eight directions. Then the movements are mapped to actions of moving forward and turning. The agent firstly changes its direction in situ to the corresponding direction before moving and then moves forward 10 cm with a preset speed.

The path planning of the unknown environment is to navigate without using the prior environment SLAM map. Figure 7 is a local map. On this map, only local information is detected by lidar in real-time, where the gray area is unknown. The agent avoids obstacles with local path planning. The starting position is set to be the current position of agent and the target point is set to be in unknown area. The unknown area between the starting point and the target point is considered as an available area. The schematic of path planning is shown in Figure 8.

Details are in the caption following the image
A local environment map.
Details are in the caption following the image
The schematic of global path planning in unknown environment.

The actual layout of the unknown environment is shown in Figure 9. In the figure, the red arrows denote preplanned global paths, which is blocked by obstacle No. 2; thus the agent needs to readjust the path to reach the target position without hitting the obstacles.

Details are in the caption following the image
The actual layout of the unknown environment.

The local path planning is completed by the trained DDQN algorithm. The inputs are angles, distance detected by lidar, and local planning points. The output is moving direction. And, meanwhile, the trajectory is smoothed. During the movement, the agent builds map and navigates in real-time. The local path planning effect of the algorithm and generalization ability of the convolutional network are tested in unknown environment. Figure 10 shows the final path. The agent followed the global path but modified its path and successfully avoided the obstacles when it encountered obstacles. A new environmental map was gradually built in the process of moving.

Details are in the caption following the image
Path planning in unknown environment.

The movement effect of the agent in actual environment is shown in Figure 11. When the agent was in front of the obstacle with a distance of about 30 cm, it changed direction and avoided the obstacle on the left. The agent approached the global path again, finally arriving at the target point. The test verifies the feasibility and efficiency of the algorithm in unknown environment.

Details are in the caption following the image
The movements of the agent in actual environment.

7. Conclusions

Deep reinforcement learning solves the problem of curse of dimensionality well and is able to process multidimensional inputs. In this paper, we design a specific training method and reward function, which effectively addresses the problem of disconvergence of the algorithm due to the reward sparsity of state space. The experimental results show that DDQN algorithm is capable and flexible for local path planning in unknown dynamic environment with the knowledge of lidar data. Compared with visual navigation tasks, the local lidar information, which can be transplanted into more complex environments without retraining the network parameters, has wider applications and better generalization performance.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by National Natural Science Foundation (NNSF) of China under Grant 11472008.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.