Improved DDPG algorithm-based path planning for unmanned surface vehicles
Abstract
As a promising mode of water transportation, unmanned surface vehicles (USVs) are used in various fields owing to their small size, high flexibility, favorable price, and other advantages. Traditional navigation algorithms are affected by various path planning issues. To address the limitations of the traditional deep deterministic policy gradient (DDPG) algorithm, namely slow convergence speed and sparse reward and punishment functions, we proposed an improved DDPG algorithm for USV path planning. First, the principle and workflow of the DDPG deep reinforcement learning (DRL) algorithm are described. Second, the improved method (based on the USVs kinematic model) is proposed, and a continuous state and action space is designed. The reward and punishment function are improved, and the principle of collision avoidance at sea is introduced. Dynamic region restriction is added, distant obstacles in the state space are ignored, and the nearby obstacles are observed to reduce the number of algorithm iterations and save computational resources. The introduction of a multi-intelligence approach combined with a prioritized experience replay mechanism accelerates algorithm convergence, thereby increasing the efficiency and robustness of training. Finally, through a combination of theory and simulation, the DDPG DRL is explored for USV obstacle avoidance and optimal path planning.
Keywords
1. INTRODUCTION
An unmanned surface vehicle (USV) is a vessel that can autonomously navigate on the surface of water and perform various tasks. Research on USVs is of great significance for marine development and monitoring because it can effectively improve efficiency, reduce risks and costs, and support the rational utilization and protection of marine resources. As a type of water transportation with broad application prospects, unmanned vessels have attracted considerable attention from researchers and industries, and they are used in various professional domains[1]. These vehicles are capable of performing various tasks in the marine environment, such as patrol security, anti-submarine warfare, marine survey, underwater detection, and rescue operations[2]. The study of USVs can aid in understanding the marine environment and resource allocation, as these vehicles can perform tasks at sea for a long time without being limited by factors such as personnel fatigue and safety. This method can also be used to perform tasks instead of personnel in rescue, reducing the risk of casualties. It is more advantageous in terms of operation and maintenance costs and does not require personnel to drive, which reduces training costs. It is favored for its efficiency, safety, and low cost. Research on path planning can help USVs to avoid obstacles, optimize navigation routes, improve efficiency, and ensure vessel safety[3]. In the dynamic and complex water environment, USVs have to travel to destinations autonomously, and therefore, they have more rigorous requirements in terms of route planning and route adjustment depending on the actual situation[4]. Traditional route planning methods have limitations and cannot adapt to real-time changes and uncertainties in complex environments. Therefore, determining how to use reinforcement learning algorithms to realize autonomous obstacle avoidance path planning has become a topic of focus in current research. For example, Su et al. proposed multi-intelligence reinforcement learning and used it in unmanned ground vehicles to capture escaping targets[5]. Moreover, they utilized the soft actor-critic (SAC) algorithm and introduced an attention mechanism to prevent collision of the capturing vehicle. Additionally, they proposed a reward function strategy by combining individual and collaborative rewards. In this context, the capturing vehicle was deemed the obstacle, the escaping vehicle was deemed the destination, and path planning was performed to reach the goal point. For example, Feng et al. proposed reinforcement learning to solve the UAV navigation problem[6]. The UAV navigation task is modeled as a Markov decision process (MDP) with parameterized actions. In addition, the sparse reward problem is also taken into account. To address these issues, develop the HER-MPDQN by combining multi-pass deep Q-network (MP-DQN) and hindsight experience replay (HER).
Many traditional path planning algorithms have been proposed, such as the A* algorithm for global path planning, which can find solutions in a limited map space. Yu et al. combined a speed barrier with A* calculation, unlike the traditional A* algorithm, to propose a new collision-proof path planning method that limits the risk of being trapped in a local minimum[7]. Zhang et al. proposed to reduce the number of turning points by increasing the cost function of the turning points to smooth a path[8]. These approaches represent significant improvements over the traditional A* algorithm, but their memory consumption is excessive when dealing with large-scale problems, and the search needs to be re-trained if the obstacles change. Yu et al. proposed the improved D*lite algorithm with an enhanced path cost function to reduce the range of nodes and shorten the path length by using the inverse distance weighted interpolation approach[9]. Jin et al. combined conflict-based search algorithms for path planning in the presence of map transformations[10]. The approaches proposed in the literature[9,10] are a significant improvement over traditional D*lite, D*lite is based on discrete grids, and therefore, continuous space processing needs to be optimized. If a large number of obstacles are encountered, it will fall into local optimization. For example, the dynamic window approach (DWA) is associated with local path planning algorithms. Zhang et al. proposed vector field adaptive DWA to clarify the obstacle information in all directions[11]. This method differs from the traditional DWA algorithm, which improves the evaluation function, solves the DWA route bypass problem, and provides the maximum acceleration to limit dynamic performance. Cao et al. designed velocity transformation and safe distance evaluation coefficients by considering the velocity and direction of movement between a robot and an obstacle[12]. They modified the DWA algorithm, which is computationally less intensive and offers real-time capabilities. However, in more complex environments, DWA falls into local minima points, and the local path planning method is more restricted in other environments. In summary, the traditional global and local path planning algorithms lack generalization ability for new environments, tend to fall into local optima, and consume excessive memory. In addition, they have insufficient ability to handle continuous space.
Reinforcement learning has been considered widely by scholars at home and abroad since it was proposed, and it is the current research hotspot in relation to path planning. The algorithms proposed thus far include Q-learning, deep Q-network (DQN), and deep deterministic policy gradient (DDPG). Zhou et al. introduced a Q-table initialization method by incorporating a new action-selection policy and a new reward function, in addition to adapting the root mean square propagation (RMSprop) method for learning rate adjustment[13]. Wang et al. combined a radial basis function (RBF) neural network with the Q-learning algorithm for function approximation to improve the convergence speed of the algorithm from literature[14]. In addition, they redesigned the reward function to take into account the USV's heading angle and turning performance and proposed a safety threshold. The improved Q-learning algorithm converged faster and planned better paths, but the computational cost of the approach was higher for high-dimensional state spaces. Moreover, when using function approximation, it may face problems such as overfitting and gradient explosion. Yu et al. defined the state and action spaces in the double DQN architecture, in addition to designing a reward function to evaluate and guide model training[15]. The DQN algorithm is an extension of Q-learning that uses deep neural networks to approximate the Q-function and handle complex, high-dimensional state spaces. Although these approaches largely improved the traditional DQN algorithm, they were adapted to discrete spaces and required increased computational effort for continuous actions. Moreover, the improved approaches were more sensitive to hyperparameters. Therefore, we adopt the DDPG algorithm, which can deal with higher-dimensional state spaces and continuous action spaces, instead of the previously described reinforcement learning algorithm. The DDPG algorithm is stable, and the inclusion of two additional target networks helps reduce jitter in the training process and improves stability. Liu et al. and Zhou et al. proposed multi-intelligence reinforcement learning path planning, and their results indicated that multi-intelligence can improve system stability in complex environments to a greater extent than single intelligence[16,17]. This shortens the training time and increases computational efficiency. Liu et al. proposed multi-intelligence and combinatorial-prioritized experience replay mechanisms to solve the problem of inefficient experience reuse[16]. However, when using a large number of intelligences, the coordination and cooperation between them must be considered, which will increase the computational burden. Moreover, a large amount of space to store the experiences of these intelligences in large-scale environments. Zhou et al. proposed the novel task decomposed multi-agent twin delayed deep deterministic policy gradient (TD-MATD3) and redesigned the reward function achieving good results in complex dynamic environments, it still did not completely solve the problem of heavy computational burden of multi-intelligence agent, and it did not involve the International Regulations for Preventing Collisions at Sea (COLREGs), did not introduce constraints, and did not conform to the actual navigation situation[17]. Wu et al. proposed combines the original sparse reward de-coupling with an artificial potential field method to design new reward and punishment functions to solve the problem of reward sparsity and improve algorithm efficiency[18]. However, the problem of sensitive parameter settings remained unsolved, and in complex environments, the artificial potential field may not be adjusted efficiently and would be extremely time-consuming with a high probability of falling into a local optimum. Chen et al. attempted to reduce Q-value overestimation and enhance the ability of an agent to explore the global optimum[19]. A warm-up stage was introduced to improve stability at the beginning of training and accelerate the convergence speed of training. However, excessive reliance on the warm-up phase, which requires substantial debugging to find the right value in more complex environments, can increase the training time. Moreover, reducing Q-value overestimation can lead to local optimality in complex environments. Feng et al. introduced intrinsic motivation and added HER to the DDPG algorithm to improve its exploration ability[20]. They used the beluga whale optimization algorithm for hyperparameter optimization to improve the algorithm training process. However, the combination of intrinsic motivation and HER increases the computation burden, which can lead to over-exploration, and HER needs to select the appropriate replay target, failing which the learning rate can decrease. Yang et al. proposed the use of the deep reinforcement learning (DRL) and velocity obstacle (VO) methods, introduced the principle of COLREGs, and added the VO algorithm to the reward function to avoid collisions based on the sailing speed[21]. In addition, they solved the problem of sparse rewards by using the fuzzy theory of computation and introduced the navigational influence factor (NIF) to construct the state space. Although the introduction of the speed barrier method is more consistent with the actual navigation situation, application of the fuzzy theory may lead to uncertainty in the decision-making process, and the influence factor increases the complexity of the state space and decreases the processing speed. Although the abovementioned studies largely improved the DDPG algorithm, a few shortcomings persist. For example, the reward design is overly complicated; the warm-up phase has been introduced, but a balance between exploration and utilization is yet to be achieved; and coordinating the emergence of multiple intelligences needs to be realized. In addition, the improved algorithm converges faster, but it is more complex, which leads to other problems.
According to the literature, this paper is integrated to provide different opinions. The DDPG reinforcement learning algorithm improves the combinatorial reward function, sets the dynamic region restriction, uses dual intelligences combined with a prioritized experience replay mechanism, sets the action space by considering the actual situation of USV navigation, and incorporates the COLREGs principle.
The contributions of this paper can be summarized as follows:
1. A DDPG reinforcement learning framework based on actor critics is proposed. This framework uses the properties of the DDPG algorithm and is applied to a continuous action space. Considering the problem of reward sparsity, a new continuous reward function is designed to improve the motivation of intelligent bodies, such that there exists a corresponding reward or punishment for each step. The convergence speed of the algorithm is improved.
2. To save computational resources, we employ the dynamic region restriction method, which is designed to let an intelligent agent ignore faraway obstacles and retain only a small number of obstacles around itself to reduce the computational burden.
3. Dual intelligences are combined with the prioritized experience replay mechanism to allow the intelligences to share their experiences with each other, which greatly reduces the training time, increases the overall planning efficiency, and improves the learning speed. Dual intelligences help avoid the scenario in which a large number of intelligences increases the computational burden, and yet, they are more efficient than a single intelligence.
4. To ensure greater consistency with the actual USV navigation situation, we set up the action space to prevent USV movement from being in line with actual large turns, such as
The remainder of this paper is summarized as follows: Section 1 describes the working principle of the DDPG reinforcement learning algorithm, and Section 2 describes how we improve the DDPG algorithm, with each subsection explaining individual improvements. In Section 3, we describe experimental simulations conducted using different-sized maps and obstacles to demonstrate algorithm feasibility and compare the proposed approach to traditional DDPG algorithms, namely deep Q-network (DON) and combined prioritized experience replay with multi-intelligent depth deterministic policy gradient (CPER-MADDPG). In the final section, the future outlook is summarized.
2. DDPG DRL ALGORITHMS
The DDPG algorithm entails reinforcement learning, which can be represented by Markov decision making. This algorithm mainly involves an intelligent agent, an environment, states, rewards, and punishments. The intelligent agent obtains a state in the current environment, and based on this state, it performs the next action and obtains a reward or punishment value. The agent interacts continuously with the environment to obtain as many rewards as possible. The basic framework is illustrated in Figure 1.
The DDPG algorithm has prominent advantages over other reinforcement learning algorithms. It can handle complex inputs such as high-dimensional state spaces, and it can use images as a function approximator through neural networks. It is more capable at handling continuous action space problems, and it uses a deterministic policy that outputs continuous actions, which helps improve the convergence of continuous action space. The DDPG algorithm solves the continuous action problem, drawing mainly on the methods of DQNs, namely experience replay and goal networks. Moreover, the DDPG algorithm adds a policy network on top of DQNs to output actions directly, which is different from other reinforcement learning algorithms. The policy network and the Q-network are essentially actor-critic (AC) structures. The policy network performs as an actor, the Q-network scores each action of the actor, and the policy network adjusts its policy continuously according to the scores of the Q-network; that is, it updates the parameters of the neural network.
The optimal strategy for a DQN is to learn a very good Q-network that will hopefully pick an action to maximize the Q-value. The actors cater to the critics, and therefore, the gradient of the optimal strategy network aims to maximize the Q-value, whereas the loss function aims to cause Q to take a negative sign, as given in
where variable
where
The training data required by the DDPG algorithm include the current state, current action, reward and punishment value, and next state, which is stored using a playback buffer and is sampled and trained to achieve optimization. The basic structure of the DDPG algorithm is depicted in Figure 2.
3. IMPROVING THE DDPG DRL ALGORITHM
In this paper, we improve and optimize the original DDPG algorithm, and in this section, we focus on the state and action space, combined excitation function design, dynamic region restriction, prioritized empirical replay mechanism, multi-intelligence agent, and introduction of Gaussian noise.
3.1. USVs kinematics model
We present a schematic of the coordinate system of USVs in Figure 3. Then, the second-order kinematic equation of the USVs is established, as follows:
where
In this USV kinematics model, we use second-order kinematics equations that consider only the speed, heading, angular velocity, acceleration, and angular acceleration; that is, only the forward speed of the USV, range of the horizontal left and right directions, forward acceleration, and acceleration of the left and right steering are considered. The heading speed
3.2. Designing state and action spaces
3.2.1. State spaces
In the DRL process, intelligence is based on the state information received from the environment to determine what action to take next. Therefore, a reasonable state space must be designed. The state space should contain the current position of the unmanned ship, position of the obstacle, and target position. We use the plane right-angle coordinate system, that is, X-axis and Y-axis, to represent the positions of the USV, obstacle, and target by coordinates, which indicate the distance between the current USV position and the obstacle and target points. The state space of environmental observation information is presented in Figure 4.
3.2.2. Action spaces
The DDPG is used to solve the continuous action space problem, where the action space needs to be designed considering the possible actions that the USV may take to avoid obstacles and reach the target point. The USV can change its heading by controlling the rudder angle to avoid obstacles and reach the target point. The common continuous range for heading design is
3.3. Improved combined reward and penalty functions
After designing the USV state and action spaces, we design the continuous reward and punishment functions based on the motion, position, obstacles, start point, and endpoint of the USV. The reward and punishment functions are used to evaluate the success of DRL decisions, which affect the convergence of the entire algorithm and play a decisive role in the whole process. The common DDPG reward and punishment functions are given as
where
When the USV touches an obstacle, it is penalized
where
where
where
where
Reward and punishment functions
Reward | Current status of USVs |
Get to the target point | |
Hit an obstacle | |
Distance to target point | |
Angle between forward direction and target point | |
Other |
3.4. Dynamic area restrictions
On top of the original foundation of state-space obstacles, we performed dynamic region restriction. The dynamic region limiting mechanism allows us to ignore the influence of those obstacles that are farther from the USV. Only the obstacles near the USV are observed, and the information of these obstacles is removed from the state space, so that the USV can explore little by little and adapt better to the complexity of the environment. State space tailoring is performed to adjust the state space according to the training process and changes in the environment around the USV at the current location. In this work, the USV is designed to observe five nearby obstacles, ignore other obstacles, reduce the number of algorithm iterations, and reduce computer resource usage, as illustrated in Figure 6.
3.5. Replay buffer
When using the DDPG algorithm for path planning, after completion of the design of the above sections, the design of the experience pool is considered, which largely determines the convergence speed of the algorithm and whether the planned path is shorter and smoother. The traditional replay buffer uses uniform random sampling batches, and consequently, certain experiences are not sampled at one time or are sampled multiple times. Intelligent agents collect samples with different degrees of importance, and to speed up the training of intelligent agents and increase training efficiency, all the experiences in the experience pool are sampled at least once by incorporating a prioritized experience replay mechanism, as follows:
where
where
3.6. Bidirectional search method
We use prioritized experience replay, on the basis of which we introduce the multi-intelligence agent idea. Two common methods are used for multi-intelligence agent experience replay. In the first method, an independent experience replay buffer is set up for each intelligent agent, where each intelligent agent uses its own prioritized experience replay mechanism. In the second method, the experiences of multiple intelligences are shared and merged into a shared replay buffer, and then, a prioritized experience replay is implemented for this buffer. In the prioritized experience replay mechanism introduced herein, dual intelligences and distributed prioritized replay, which applies the prioritized experience replay mechanism in a distributed environment by using the experience of multiple intelligences to update the priority level together, are employed. Multi-intelligences can store experiences in a buffer for sample training and learn from shared experiences. The shared experience buffer increases the diversity of experiences, provides more sample data, and accelerates the optimization of strategies in cooperative tasks, which helps increase the training efficiency. Moreover, a shared buffer can reduce the amount of maintenance required compared to individual buffers, which reduces resource consumption, just as a single intelligence does. The two intelligent agents begin at the start and endpoints and move toward the end and start points, respectively, before they collide in the middle and finally find the optimal path, as illustrated in Figure 7.
3.7. Gaussian noise
Upon the introduction of noise in the DDPG algorithm, the policy network outputs a deterministic action value instead of randomly sampled actions. However, to add explorability to the DDPG algorithm, Gaussian noise is added to the deterministic action. A randomized Gaussian noise process is used to maintain a degree of exploration while maintaining continuity and smoothness.
The action value output from the strategy network is added to the random noise obtained by sampling the Gaussian distribution, and the result is then used as the final action output. Noise variance is reduced gradually during the training process, and in the later training stages, the output of the strategy network itself is relied upon. As the number of training steps increases, the noise variance decreases, as follows:
where
By introducing random Gaussian noise and gradually decreasing its variance, we can balance the exploratory and exploitative aspects and improve the performance and stability of the algorithm to some extent.
3.8. Improved DDPG algorithm pseudo-code
To enhance the understanding of the running process of the improved DDPG reinforcement learning algorithm 1, its pseudocode is provided, as follows:
Algorithm 1: Improving the DDPG reinforcement learning algorithm Initialization: Parameters of the Q-network and the policy network for each intelligence.
Two target network parameters for each intelligence.
Store, the experience playback pool, sets the prioritization adjustment parameter
3.9. Comparison of advantages and disadvantages of existing algorithms
In Table 2, the strengths and weaknesses of the existing DDPG algorithms described in the Introduction are presented concisely and clearly.
Comparison of the existing algorithms
Existing algorithms | Advantages | Disadvantages |
Liu et al.[16] | A combination of multi-intelligentsia and combinatorial prioritization of experience replay mechanisms | Multi-intelligence coordination and cooperation. Requires a lot of space to store experience |
Zhou et al.[17] | Multi-intelligent agent TD-MATD3 binding, a new reward function | Multi-intelligence coordination and cooperation. Requires a lot of space to store experience |
Wu et al.[18] | Artificial potential field method for designing reward and punishment functions to improve prioritized experience replay structure | The parameter settings are sensitive, and the manual potential field is not efficiently adjusted, time-consuming, and trapped in a local optimum |
Chen et al.[19] | Reducing the overestimation of Q, improving the reward function, and introducing a warm-up phase | Relies on a warm-up phase that requires extensive debugging to increase training time. Reduces overestimation of Q and falls into local optimality in complex environments |
Feng et al.[20] | Introducing intrinsic dynamics HER, beluga whale optimization algorithm for hyperparameter optimization | The combination of intrinsic motivation and HER increases computation, leading to over-exploration, and HER requires suitable targets for replay |
Yang et al.[21] | DRL and speed barriers combined, COLREGs introduced, VO algorithm incorporates reward functions, fuzzy theory calculations, NIF constructs state spaces | The application of fuzzy theory leads to uncertainty in the decision making process and the influence factors increase the state space complexity |
The advantages of the algorithm proposed herein compared to the existing algorithms are as follows: A new continuity reward function is designed considering the problem of reward sparsity. A dynamic region restriction method is introduced to reduce the computational burden by allowing an intelligent agent to ignore faraway obstacles and retain only a few obstacles around itself. A dual-intelligent agent combined with a prioritized experience replay mechanism method is added, where the dual-intelligent agent avoids the problem of too many intelligences leading to a larger burden and improves the efficiency compared to that of a single intelligent agent. Intelligent agents are allowed to share experiences with each other, which reduces the training time substantially and increases the overall planning efficiency and learning speed. To ensure greater consistency with the actual USV sailing situation, we set up the action space such that large USV movements such as 180-degree turns are avoided; the COLREGs principle is introduced; and the planned path is away from obstacles to improve safety and optimize sailing routes.
4. RESULTS
4.1. Modeling of the mission environment
We used Python to construct a 2D raster map representing the environment of the USV moving on the surface of water. The size of the raster map was set as
4.2. Parameter setting
A few specific parameter settings of the improved DDPG algorithm proposed herein are presented in Table 3.
Parameter setting
Parameters | Retrieve a value |
0.001 | |
0.001 | |
0.99 | |
0.005 | |
10 | |
0.1 | |
512 | |
512 | |
300 | |
50 | |
0.3 | |
0.5 | |
5 |
4.3. Algorithm comparison
The start point is
As illustrated in Figure 9, from the path planning perspective, the original DDPG algorithm and the improved DDPG algorithm proposed herein were able to reach the destination from the starting point. The proposed improved DDPG algorithm planned a smoother path than that planned by the original DDPG algorithm; the planned paths were shorter and did not exhibit excessive jitter.
Figure 9. Paths planned using original and improved DDPG algorithms. DDPG: Deep deterministic policy gradient.
Meanwhile, depending on the actual situation, the reward functions of the action space and continuity were changed such that the planned path was far away from obstacles and smoother without excessive steering. The steering angle and forward distance of the final path planned using the improved DDPG algorithm are illustrated in Figure 10, where the steering angle is not excessively jittery and is stable, and the forward distance is maintained at
Figure 11 depicts the number of coordinate points of the X- and Y-axes of the path planned using the improved DDPG algorithm.
As illustrated in Figure 12, herein, we use dual intelligences combined with a prioritized experience replay mechanism. The proposed improved DDPG algorithm can reach relatively stable values within a shorter time in the early stages of training compared to the original DDPG algorithm, and the combined rewards for each episode in the early stage of training are higher than those of the original algorithm. The improved DDPG algorithm tends to reach higher values at around
In the same environment, we apply the DQN algorithm, and the results are presented in Figure 13, along with a comparison of the paths planned by the three algorithms. The DQN algorithm is trained, and in the current environment from the starting point, the jitter of the planned path is excessive, and the path does not reach the target point. The convergences of the DQN, DDPG, and the proposed improved DDPG algorithms are illustrated Figure 14.
The improved DDPG algorithm was compared to the original DDPG and DQN algorithms. The results indicated that the DQN algorithm planned an incomplete path, and it did not converge. The proposed improved DDPG algorithm changed the state and action spaces and redesigned the continuity reward and punishment functions such that the ranges of steering and advancement were narrowed. This, to a certain extent, restricted excessive fluctuation of the USV trajectory and smoothed and shortened the planned path, as illustrated in Figure 13. Simultaneously, the multi-intelligence agent priority experience replay mechanism was used, one from the middle to the starting point and one from the starting point to the endpoint. This reduced the planning distance to a certain extent. In addition, the dynamic area restriction was used to observe five obstacles in the vicinity while ignoring the other obstacles to reduce the number of algorithmic iterations, save computer resources, and accelerate the convergence speed of the algorithm, as depicted in Figure 14. The path planned by the improved DDPG was
Comparison of traditional and proposed DDPG algorithms
Path length | Comprehensive incentives | |
DQN | Unfinished | Not converged |
DDPG | 34.05 | 900 |
Improvement of DDPG | 29.15 | 670 |
The parameters
Figure 15. Comparison of paths planned using the improved DDPG with parameter changes. DDPG: Deep deterministic policy gradient.
The length of the planned path was
Figure 16. Convergences of improved DDPG algorithm and the algorithm with changed parameters. DDPG: Deep deterministic policy gradient.
Based on the original environment, the starting and endpoints were changed to
To further verify the effectiveness of the proposed improved algorithm, based on the initial environment, we added
Figure 18. Improving DDPG algorithm path in complex environments. DDPG: Deep deterministic policy gradient.
In the above experiments, we increased only the number and type of obstacles and changed the starting and endpoints. To explore further, we increased the map size to
Figure 19. Paths obtained by applying improved DDPG algorithm to different map sizes. DDPG: Deep deterministic policy gradient.
The results of the above experiments indicate that the proposed improved DDPG algorithm performed path planning effectively. To further validate this result, we used the method proposed by Liu et al., which mainly adopts MADDPG combined with a prioritized empirical replay mechanism, and makes a small change to the prioritized replay mechanism to redefine the prioritized ordering[16]. We extracted only the method (CPER-MADDPG) and applied it to USV path planning in this work. The method proposed by Liu et al. has some similarities to the improved method proposed herein, and therefore, it was selected for comparison[16].
Because only the method was extracted, to verify whether the method could plan the path in the environment considered herein, the experimental environment was changed to a simpler
Figure 20. Comparison of the paths planned using the improved DDPG algorithm and the CPER-MADDPG algorithm. DDPG: Deep deterministic policy gradient; CPER-MADDPG: combined prioritized experience replay with multi-intelligent depth deterministic policy gradient.
The results of the above experiments verified the feasibility of the proposed method, and therefore, the environment was changed to the initial environment for comparison; the results are presented in Figure 21. The improved DDPG algorithm planned a smoother path with less jitter and fewer sharp corners compared to that planned by the CPER-DDPG algorithm. The lengths of the paths planned by the improved DDPG algorithm and CPER-DDPG algorithm were
Figure 21. Comparison of paths planned using the proposed improved DDPG algorithm and CPER-MADDPG for different maps. DDPG: Deep deterministic policy gradient; CPER-MADDPG: combined prioritized experience replay with multi-intelligent depth deterministic policy gradient.
Figure 22. Convergences of improved DDPG and CPER-MADDPG algorithms. DDPG: Deep deterministic policy gradient; CPER-MADDPG: combined prioritized experience replay with multi-intelligent depth deterministic policy gradient.
In sum, we compared the proposed algorithm with the traditional DDPG algorithm and DQN algorithm. By changing the parameter settings of the proposed improved DDPG algorithm for comparison, increasing the number of irregular triangular obstacles in the original environment, using different map sizes, and extracting the method proposed in[16] and applying it to the experimental environment of this paper for comparison, we verified the feasibility of the proposed improved DDPG algorithm.
5. CONCLUSIONS
In this paper, the improved DDPG algorithm was proposed for USV path planning. Based on the DDPG algorithm, the continuity reward function was redesigned, and the COLREGs principle was introduced to solve the reward sparsity problem. Dynamic region restriction was added to reduce the number of algorithm iterations and save computational resources. The multi-intelligence body bidirectional search method and distributed priority replay mechanism were introduced to accelerate algorithm convergence. Through experiments, it was demonstrated that the improved DDPG algorithm converged faster, and the paths planned using it were smoother and shorter, thereby verifying the feasibility of the improved algorithm. The improved DDPG algorithm increased the efficiency and accuracy of path planning, as well as the safety of USV navigation by introducing COLREGs, which helps USVs navigate complex environments. In terms of the dynamics of the setup, the state space still does not consider the real sailing situation. Meanwhile, in the future, the performance of the proposed algorithm should be explored in different environments by considering more real-time factors such as the wind, waves, and weather conditions. Additionally, the algorithm could also be extended to other types, such as financial trading to optimize the timing of automated buying and selling, and healthcare to dynamically adjust drug dosages. Furthermore, deep learning techniques could be incorporated to optimize the continuous reward function and dynamic region restriction.
DECLARATIONS
Authors' contributions
Carried out the numerical simulation, wrote the improved DDPG code, and wrote the draft: Hua M
Guided the work, and proofread the draft: Zhou W
Helped collect and process the data: Cheng H, Chen Z
Availability of data and materials
Not applicable.
Financial support and sponsorship
This work was supported by the National Natural Science Foundation of China (No. 62203293). It was also sponsored by the Shanghai Rising-Star Program (No. 22YF1416100)
Conflicts of interest
Zhou W is the Guest Editor Assistant of the Special Issue of "Intelligent Control and Decision-making of Unmanned Ships" of the journal Intelligence & Robotics, while the other authors have declared that they have no conflicts of interest.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Copyright
© The Author(s) 2024.
REFERENCES
1. Tabish N, Chaur-Luh T. Maritime autonomous surface ships: a review of cybersecurity challenges, countermeasures, and future perspectives. IEEE Access 2024;12:17114-36.
2. Sun Q, Liu ZW, Chi M, Ge MF, He D. Robust coverage control of multiple USVs with time-varying disturbances. Intell Robot 2023;3:242-56.
3. Liang Y. Route planning of aids to navigation inspection based on intelligent unmanned ship. In: 2021 4th International Symposium on Traffic Transportation and Civil Architecture (ISTTCA); 2021 Nov 12-14; Suzhou, China. IEEE; 2021. pp. 95-8.
4. Li H, Yang Z. Incorporation of AIS data-based machine learning into unsupervised route planning for maritime autonomous surface ships. Transport Res E Log 2023;176:103171.
5. Su M, Pu R, Wang Y, Yu M. A collaborative siege method of multiple unmanned vehicles based on reinforcement learning. Intell Robot 2024;4:39-60.
6. Feng S, Li X, Ren L, Xu S. Reinforcement learning with parameterized action space and sparse reward for UAV navigation. Intell Robot 2023;3:161-75.
7. Yu D, Roh MI. Method for anti-collision path planning using velocity obstacle and A* algorithms for maritime autonomous surface ship. Int J Nav Arch Ocean 2024;16:100586.
8. Zhang B, Cao J, Li X, et al. Minimum dose walking path planning in a nuclear radiation environment based on a modified A* algorithm. Ann Nucl Energy 2024;206:110629.
9. Yu J, Yang M, Zhao Z, et al. Path planning of unmanned surface vessel in an unknown environment based on improved D*Lite algorithm. Ocean Eng 2022;266:112873.
10. Jin J, Zhang Y, Zhou Z, Jin M, Yang X, Hu F. Conflict-based search with D* lite algorithm for robot path planning in unknown dynamic environments. Comput Electr Eng 2023;105:108473.
11. Zhang J, Zhu Z, Xue Y, Deng Z, Qin H. Local path planning of under-actuated AUV based on VADWA considering dynamic model. Ocean Eng 2024;310:118705.
12. Cao Y, Nor NM. An improved dynamic window approach algorithm for dynamic obstacle avoidance in mobile robot formation. Decis Anal 2024;11:100471.
13. Zhou Q, Lian Y, Wu J, Zhu M, Wang H, Cao J. An optimized Q-Learning algorithm for mobile robot local path planning. Knowl Based Syst 2024;286:111400.
14. Wang Y, Lu C, Wu P, Zhang X. Path planning for unmanned surface vehicle based on improved Q-Learning algorithm. Ocean Eng 2024;292:116510.
15. Yu Y, Liu Y, Wang J, Noguchi N, He Y. Obstacle avoidance method based on double DQN for agricultural robots. Comput Electron Agr 2023;204:107546.
16. Liu P, Ma X, Ding J, Gu C. Multi-agent collaborative path planning algorithm with reinforcement learning and combined prioritized experience replay in Internet of Things. Comput Electr Eng 2024;116:109193.
17. Zhou Y, Kong X, Lin KP, Liu L. Novel task decomposed multi-agent twin delayed deep deterministic policy gradient algorithm for multi-UAV autonomous path planning. Knowl Based Syst 2024;287:111462.
18. Wu X, Zhao Z, Ge Q, Luo Z. Unmanned aerial vehicle path planning based on DP-DDPG algorithm. In: 2023 IEEE International Conference on Unmanned Systems (ICUS); 2023 Oct 13-15; Hefei, China. IEEE; 2023. pp. 251–6.
19. Chen Q, Yang J, Mao J, Liang Z, Lu C, Sun P. A path following controller for deep-sea mining vehicles considering slip control and random resistance based on improved deep deterministic policy gradient. Ocean Eng 2023;278:114069.
20. Feng Z, Wang C, An J, et al. Emergency fire escape path planning model based on improved DDPG algorithm. J Build Eng 2024;95:110090.
Cite This Article
How to Cite
Hua, M.; Zhou, W.; Cheng, H.; Chen, Z. Improved DDPG algorithm-based path planning for unmanned surface vehicles. Intell. Robot. 2024, 4, 363-84. http://dx.doi.org/10.20517/ir.2024.22
Download Citation
Export Citation File:
Type of Import
Tips on Downloading Citation
Citation Manager File Format
Type of Import
Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.
Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.