A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning
Abstract
Multi-vehicle pursuit (MVP) is one of the most challenging problems for intelligent traffic management systems due to multi-source heterogeneous data and its mission nature. While many reinforcement learning (RL) algorithms have shown promising abilities for MVP in structured grid-pattern roads, their lack of dynamic and effective traffic awareness limits pursuing efficiency. The sparse reward of pursuing tasks still hinders the optimization of these RL algorithms. Therefore, this paper proposes a distributed generative multi-adversarial RL for MVP (DGMARL-MVP) in urban traffic scenes. In DGMARL-MVP, a generative multi-adversarial network is designed to improve the Bellman equation by generating the potential dense reward, thereby properly guiding strategy optimization of distributed multi-agent RL. Moreover, a graph neural network-based intersecting cognition is proposed to extract integrated features of traffic situations and relationships among agents from multi-source heterogeneous data. These integrated and comprehensive traffic features are used to assist RL decision-making and improve pursuing efficiency. Extensive experimental results show that the DGMARL-MVP can reduce the pursuit time by 5.47% compared with proximal policy optimization and improve the pursuing average success rate up to 85.67%. Codes are open-sourced in Github.
Keywords
1. INTRODUCTION
Enabled by novel sensing technology[1] and the self-learning ability of reinforcement learning (RL)[2], the intelligent traffic management system is enjoying a significant upgrade and showing great potential to solve various problems in intelligent transportation systems (ITS)[3]. As a complex special scene, multi-vehicle pursuit (MVP) describes the problem of multiple vehicles capturing several moving targets[4], represented by the New York City Police Department guideline on the pursuit of suspicious vehicles[5]. Moreover, various military intelligence combat scenes can also be modeled as MVP[6]. Effective reward guidance[7] and comprehensive perception[8] of complex and dynamic urban traffic environments are the keys to solving the MVP problem and are gradually becoming hot topics.
Aiming at the MVP problem, Garcia et al. extended classical differential game theory and devised saddle-point strategies[9] to address multi-player pursuit-evasion problems. Xu et al. considered greedy, lazy, and traitorous pursuers during the pursuit and rigorously re-analyzed Nash equilibrium[10]. A graph-theoretic approach[11] was employed to study the interactions of the agents and obtain distributed control policies for pursuers. A region-based relay pursuit scheme[12] was designed for the pursuers to capture one evader. Jia et al. proposed a policy iteration method-based continuous-time Markov decision process (MDP)[13] to optimize the pursuer strategy. However, these classical methods for MVP are not competent for complex traffic scenes with more constraints due to poor robustness. De Souza et al. introduced distributed multi-agent RL and curriculum learning to MVP problems[14]. To improve pursuing efficiency, Zhang et al. constructed a multi-agent coronal bidirectionally coordinated with a target prediction network[15] based on the multi-agent deep deterministic policy gradient. For efficient cooperation among pursuers, Yang et al. designed a hierarchical collaborative framework[16]. Zheng et al. extended multi-to-multi competition to air combat among unmanned aerial vehicles[17]. However, due to the mission nature of MVP, the pursuers only obtain a sparse reward after successfully capturing an evader. None of the aforementioned RL-based methods have addressed the sparse reward problem. This issue blurs the direction of the gradient descent of neural networks and seriously affects the strategy optimization. In addition, the lack of dynamic and effective awareness in the above MVP methods limits pursuing efficiency.
Due to powerful capabilities of distribution feature extraction and data generation, generative adversarial networks (GANs) have drawn growing interest in recent years[18] and have been combined with RL to optimize strategies. To address the problem of incomplete observation of traffic information, Wang et al. used GANs for traffic data recovery to assist in deep RL (DRL) decision-making[19]. A GAN-assisted human preference-based RL approach[20] was proposed that adopted a GAN to learn human preferences. Li et al. designed a conditional deep generative model to predict future trajectory distribution[21]. The adversarial training of GANs was introduced into the policy network and critic network[22] to optimize RL training. Zheng et al. developed a reward-reinforced GAN[23] to represent the distribution of the value function. However, mission-critical requirements of MVP pose significant challenges to these methods. The problem of the sparse reward remains unsolved, hindering the RL optimization.
Graph neural networks (GNNs) have an excellent ability to handle unstructured data and are widely applied to modeling multi-agent interactions and feature extraction of traffic information. Liu et al. modeled the relationship between agents by a complete graph[24] to indicate the importance of the interaction between two agents. For cooperation among heterogeneous agents, Du et al. proposed a heterogeneous graph attention network[25] to model the relationships among these diverse agents. GNNs were employed to model vehicle relationships and extract traffic features to enhance autonomous driving[26,27]. A GNN with spatial-temporal clustering[28] was designed for traffic flow forecasting. However, the single-layer GNN structure in the above methods did not couple the interaction model and traffic information of agents, which affects the RL collaborative game decision-making in complex urban traffic scenes.
In summary, as for the existing approaches for MVP, sparse reward and the lack of comprehensive traffic cognition severely limit the collaboratively pursuing efficiency. To address these problems, this paper proposes distributed generative multi-adversarial RL for MVP (DGMARL-MVP) in urban traffic scenes, as shown in Figure 1. Firstly, a generative multi-adversarial network (GMAN) is designed to guide RL strategy optimization via generating dense rewards, replacing the approximation of Bellman updates. The generative multi-adversarial RL can be applied to a wide range of multi-agent systems with sparse rewards to improve task-related performance. Moreover, a proposed GNN-based intersecting cognition promotes deep coupling of traffic information and multi-agent interaction features. The contributions of this paper are summarized as follows.
Figure 1. Architecture of DGMARL-MVP. Urban traffic environments for MVP (A) provide complex pursuit-evasion scenes and interactive environments for RL. Every pursuing vehicle targets the nearest evading vehicle and launches a collaborative pursuit. GNN-based intersecting cognition (B) couples the traffic information and multi-agent interaction features to assist GMAN boosting reinforcement learning (C) in decision-making. GMANs (D) to guide RL strategy optimization via generating dense rewards, replacing the approximation of Bellman updates. MVP: Multi-vehicle pursuit; GNNs: graph neural networks; GMAN: generative multi-adversarial network.
● This paper proposes DGMARL-MVP in urban traffic scenes. In DGMARL-MVP, a GMAN is designed to improve the Bellman equation by generating the potential dense reward, thereby properly guiding strategy optimization of distributed multi-agent RL (MARL).
● GNN-based intersecting cognition is proposed to promote deep coupling of traffic information and multi-agent interaction features to assist in improving the pursuing efficiency.
● This paper applies DGMARL-MVP to the simulated urban roads with 16 junctions and sets different pursuing difficulty levels with variable numbers of pursuing vehicles and evading vehicles. In the three tested difficulty levels, DGMARL-MVP reduces the pursuit time by 5.47
The rest of this paper is organized as follows. Section Ⅱ describes MVP in an urban traffic scene and models the MVP problem based on the MDP. Section Ⅲ presents generative multi-adversarial RL (GMARL) and its training process. Section Ⅳ presents distributed GMARL with GNN-based intersecting cognition for MVP. Section Ⅴ gives the performance of the proposed method. Section Ⅵ draws conclusions.
2. MULTI-VEHICLE PURSUIT IN DYNAMIC URBAN TRAFFIC
This section first introduces the details of the complex urban traffic environment for the MVP problem. Then, the modeling process of the MVP problem is stated as an MDP, and the basic Q-learning algorithm focusing on the update process is introduced.
2.1. Complex urban traffic environment for MVP
This paper focuses on the problem of MVP under the complex urban traffic and constructs a multi-intersection traffic scene. Each road is set to bidirectional two lanes and fixed-phase traffic lights at each intersection. In this scene, there are
Furthermore, the following constraints are set in the MVP environment: (1) All vehicles obey the traffic rules for collision-free driving; (2) The maximum speed
2.2. MDP-based MVP problem formulation
In this paper, the decision-making of each pursuing vehicle only depends on the current state, so the decision process can be modeled as the MDP defined by a tuple
RL provides an excellent solution to MDP games. As an advanced RL algorithm for the problem with discrete action space, Q-learning enables decision-making without exact state transition probability and initial state. For a Q-learning-based agent, the expectation values
And the updating process of Q-learning can be expressed as
where
3. GENERATIVE MULTI-ADVERSARIAL REINFORCEMENT LEARNING
In order to effectively solve the reward sparsity problem of MVP, a GMAN is introduced to improve the Bellman equation by generating suitable potential dense rewards during optimizing RL. Section
3.1. Generative multi-adversarial network for dense reward
As a special game task, the reward of MVP is extremely sparse. Only when the pursuing vehicle captures an evading vehicle can the RL-based agent obtain a reward. The sparse reward blurs the optimization direction of RL, thus seriously hindering the strategy update. In this paper, a conditional generative network
Suppose the state of the agent
The optimization objective of the generative network is to learn the contribution of the cumulative rewards
In GMAN, the generating network
In practice, training against a far superior discriminator can impede the learning of the generator. To solve this problem and increase the stability of the generator, a classical Pythagorean mean is chosen as the fusion function
Then, the discriminator
Therefore, by training with historical experience replay, a GMAN is able to generate potential future rewards
3.2. GMAN boosting reinforcement learning
The sparsity of rewards is a great challenge for the optimization of RL. In MVP, the RL-based agent explores many steps to obtain only one positive or negative reward, which leads to a vague direction of gradient descent for the agent. Therefore, this paper proposes a novel GMAN boosting RL. GMAN boosting RL generates reasonably dense rewards via virtue of the powerful generative power of the generative network. And the generated reward also includes the potential future benefit of the RL decision to improve the learning efficiency and the decision foresight of RL.
In GMAN boosting RL, the Bellman equation is modified using the generated reward. The approximation of future reward is replaced by
In this paper, the deep neural network
Distributed on-policy training is adopted in the proposed GMAN Boosting RL. The overall training process is shown in Algorithm 1. For every RL-based agent, the experience collected by the current policy is stored in the replay buffer
Algorithm 1: Training Process of GMAN Boosting Reinforcement Learning Input: RL-based agent Q, generator G, I discriminators, and the experience replay buffer
4. DISTRIBUTED GMARL WITH GNN-BASED COGNITION FOR MVP
To enhance comprehensive cognition of complex urban traffic in MVP, a novel double-layer intersecting GNN is proposed to couple the traffic information and multi-agent interaction features. Section
4.1. GNN-based intersecting cognition
In this paper, a double-layer intersecting graph network is used with a road graph to perceive the traffic condition and a vehicle graph to extract efficient information for pursuing vehicles, as shown in Figure 2. And the main idea of intersecting lies in using the perceived traffic information to construct the vehicle graph. It enables a deep coupling of road information with vehicle information.
Each lane is modeled as a node on the first road graph, and the topological relationship of the road is regarded as the edge of the graph. More formally, the constructed road graph is described as
The first road graph network consists of fully connected layers (FC)
where
The vehicle graph
where
4.2. Distributed GMARL for MVP
In this paper, a deep neural network is adopted to fit
Each pursuing vehicle distributedly makes decisions based on its own observations and shared information. For pursuing vehicle
Due to the constraints of the traffic scene, the action space of the pursuing vehicles contains three elements, i.e., turning left, turning right, and going straight at the next intersection. And the expectation values of turning left
To motivate the capture of pursuing vehicles and incentivize efficient training, an elaborately designed reward
1. Only the pursuing vehicle
2. A distance-sensitive reward is set to improve the pursuing efficiency. When a pursuing vehicle reduces the distance from the closest evading vehicle compared to that at the last time step, it will obtain a positive reward, and conversely, it will be punished with a negative reward.
Therefore, the formulation of
where
Each agent is updated by gradient descent in distributed training. Specifically,
4.3. Decision-making and training process of DGMARL-MVP
This part presents the overall decision-making and online training process of DGMARL-MVP, as shown in Algorithm 2. At the beginning of each episode, the urban pursuit-evasion environment and the local state of all agents are initialized. Then, the road information and the position information of vehicles are fed into the intersecting graph network outputting
Algorithm 2: DGMARL-MVP Decision-making and Online Training Algorithm
In the decision-making process of DGMARL-MVP, collaboration among agents is performed in the information sharing. During the pursuit process, every pursuing vehicle shares its own position and observation information with other pursuing vehicles for collaboration. The shared information is used to develop GNN-based intersecting cognition and gain effective and comprehensive awareness of the agent relationships and traffic situations.
5. EXPERIMENTS AND RESULTS
5.1. Simulator and parameter settings
As a MARL algorithm, DGMARL-MVP collects training data and updates parameters by interacting with the simulated urban traffic environment. This paper constructs a complex urban traffic environment based on SUMO[29] to verify the effect of the DGMARL-MVP. The environment with
Simulation settings
Parameters | Values |
Maximum time steps | 800 |
Maximum speed | 20 |
Maximum acceleration | 0.5 |
Maximum deceleration | 4.5 |
Number of lanes | 48 |
Length of location code | 7 |
Number of junctions | 16 |
Length of each lane | 500 |
Number of background vehicles | 200 |
Parameter settings
Parameters | Values | Parameters | Values |
500 | |||
0.9 | 0 | ||
0.5 | 5 | ||
0.05 | 2600 |
Structure of the deep Q network
Layers | Deep Q network |
Input | (batch size, |
Dense Layer 1 | ( |
Activation Function | |
Dense Layer 2 | (32, 48) |
Activation Function | |
Dense Layer 3 | (48, 32) |
Activation Function | |
Dense Layer 4 | (32, 16) |
Activation Function | |
Dense Layer 5 | (16, 3) |
Activation Function | |
Output | (batch size, 3) |
Structure of the discriminator
Layers | Discriminator |
Input | (batch size, 1) |
Dense Layer 1 | (1, 128) |
Activation Function | |
Dense Layer 2 | (128, 64) |
Activation Function | |
Dense Layer 3 | (64, 1) |
Output | (batch size, 1) |
5.2. Ablation experiments
The ablation experiments are conducted to further demonstrate the effectiveness of the proposed method and examine the impact of the GMAN boosting RL and GNN-based intersecting cognition in the DGMARL-MVP. Specifically, a is the method that DQN is only equipped with GNN-based intersecting cognition, and b is the method that GMAN boosting RL is not equipped with GNN-based intersecting cognition. The results are shown in Table 5.
Evaluation results
N-M | P4-E2 | P5-E3 | P7-E4 | ||||||
Evaluate metrics | Average Reward | Average Time Steps | Success Rate | Average Reward | Average Time Steps | Success Rate | Average Reward | Average Time Steps | Success Rate |
1 a: DQN equipped with GNN-based intersecting cognition; b: GMARL without GNN-based intersecting cognition. | |||||||||
DGMARL-MVP | 8.688 | 644.96 | 0.91 | 8.827 | 698.20 | 0.85 | 9.213 | 731.64 | 0.81 |
a | 7.407 | 695.33 | 0.86 | 8.592 | 728.53 | 0.81 | 8.791 | 751.32 | 0.78 |
b | 8.094 | 684.01 | 0.87 | 8.692 | 714.36 | 0.82 | 8.959 | 739.55 | 0.80 |
DQN | 6.953 | 736.09 | 0.81 | 7.195 | 763.31 | 0.72 | 8.122 | 749.76 | 0.76 |
PPO | 7.513 | 717.41 | 0.86 | 8.368 | 731.88 | 0.80 | 8.844 | 745.51 | 0.78 |
QMIX | 6.339 | 745.27 | 0.77 | 8.645 | 749.09 | 0.74 | 8.725 | 755.48 | 0.73 |
Compared with a, the average reward of DGMARL-MVP is increased by 17.29
In addition, it can be obtained that the proposed DGMARL-MVP shows a higher average reward than b from Table 5, exactly 7.38
Furthermore, Figure 5 depicts the bar chart comparison of the three metrics, average reward, time steps, and success rate, for DGMARL-MVP, a, and b in P4-E2, P5-E3, and P7-E4 scenes, offering a more intuitive illustration of the effectiveness of the proposed modules. It is evident that DGMARL-MVP has the best performance for all metrics in any scene. The ablation experiments confirm that the proposed GMAN boosting RL algorithm can generate appropriate potential dense rewards, which makes RL more forward-looking in policy updating and correctly guides the optimization direction of RL policy, thereby improving the stability of distributed multi-agent system and enhancing the optimality of agent decision-making. Meanwhile, the proposed GNN-based intersecting cognition can adequately couple the interaction features of agents with traffic information and enhance their ability to handle multi-source heterogeneous data so as to promote the adaptability of the agents to the dynamic environment and improve the pursuing efficiency.
5.3. Comparison with other methods
This part demonstrates the performance of applying DGMARL-MVP and other algorithms to three scenes of MVP problems. This paper uses DQN, QMIX[30], and PPO for comparison. The details are shown in Table 5.
In the MVP problem of P4-E2, three metrics show consistency in performance evaluation. It is clear that DGMARL-MVP is noticeably the strongest performer on all of the metrics, which indicates the superiority of our proposed DGMARL-MVP. The success rate is an appreciable 91
Upgrading the difficulty to P5-E3, the proposed DGMARL-MVP algorithm still shows superior performance among other comparison algorithms. This superiority is specifically manifested in that our algorithm is 2.1
The difficulty setting of P7-E4 is approximately the same as that of P5-E3, but the increase of vehicles in both the pursuing team and the evading team increases the difficulty of global scheduling. However, from Table 5, the proposed DGMARL-MVP still shows satisfactory optimal performance among three other algorithms. In terms of the average reward metric, DGMARL-MVP is 4.17
By comparing the performance of all algorithms in these three scenes, the proposed DGMARL-MVP is the most stable algorithm and also performs the best. In spite of this stable performance, there are differences in the performance of DGMARL-MVP in the three scenes. As the difficulty of the scene increases, for example, from P4-E2 upgrading to P5-E3, the success rate of the pursuit decreases by 7.06
In order to show the performance variation and comparison of all algorithms more clearly, a bar chart is used to show the changing trend of the three metrics in three scenes, as shown in Figure 6. Intuitively, as the number of evading vehicles increases, the average reward increases at the same time. In Figure 6A, unexpectedly, the average reward of the proposed DGMARL-MVP presents the highest rewards but a slight increase, and QMIX presents the largest increase. An inference of the reason could be that DGMARL-MVP provides better decisions during the pursuit, resulting in a high reward accumulation. As the difficulty increases in Figure 6B, DGMARL-MVP has a larger increase than other algorithms on the metric of average time steps, which illustrates that DGMARL-MVP presents more advantages in a sample scene than in difficult scenes. In terms of the success rate in Figure 6C, DGMARL-MVP, with the other algorithms, shows a downward trend, although DGMARL-MVP reaches the highest, except DQN, which shows a rise of performance as the scene changes from P5-E3 to P7-E4. The negative impact of increasing pursuing difficulty on success rates is indisputable, but the performance of DGMARL-MVP presents more stable, which shows the generalization of DGMARL-MVP.
5.4. Convergence comparison during training
In order to more convincingly prove the advantages of the proposed method, Figure 7 describes the convergence curves of average reward with training steps for various methods, including DGMARL-MVP, a, PPO, and QMIX, in P4-E3, P5-E3, and P7-E4 scenes. In Figure 7A, in the P4-E2 scene, it can be seen that compared with a, of which the fluctuation has a slight advantage over other methods in the last stage of training, DGMARL-MVP has better performance in convergence rate and convergence target. For the P5-E3 scene, as shown in Figure 7B, although all the methods showed similar convergence stability at the last stage of training, our method has a better convergence trend, which has a growing trend and a higher convergence target during the training. Figure 7C depicts the convergence curve of average reward with training steps in the P7-E4 scene, presenting that DGMARL-MVP has superior performance over other methods in both convergence rate and convergence trend. In conclusion, Figure 7 illustrates that compared with PPO and QMIX, which are separately the best and the worst of all comparison methods, DGMARL-MVP makes the competitive convergence rate and trend, demonstrating its superiority and effectiveness on MVP under urban environments.
Figure 7. Convergence Process During Training. (A): Convergence Process of Average Reward for Methods in P4-E2. (B): Convergence Process of Average Reward for Methods in P5-E3. (C): Convergence Process of Average Reward for Methods in P7-E4.
With the horizontal comparison of the convergence of the proposed DGMARL-MVP in three scenes, the convergence performance of DGMARL-MVP in the three scenes is basically the same, and the convergence starts at about 1950 time steps. Compared with the other three algorithms, in all three scenes with different difficulties, the time step of convergence of DGMARL-MVP is basically the same as that of other algorithms. It is worth mentioning that the proposed DGMARL-MVP improved surprisingly quickly at the beginning of the training process, which indicates that our algorithm can better guide the direction of training at the initial stage. For this, it is not surprising that the reward of DGMARL-MVP remains the highest from the beginning to the end of the training in the P4-E2 scene and P7-E4 scene. An exception occurs in the scene of P5-E3; the average reward is reversed by DQN in the later stages of training, but the final result is still the best. In addition, the proposed DGMARL-MVP exhibits much smaller fluctuations in this scene, which shows the stability of the proposed algorithm.
6. CONCLUSIONS
This paper has proposed DGMARL-MVP to address the sparse rewards and insufficient perception of complex traffic situations brought by the MVP under urban traffic environments. In DGMARL-MVP, a GMAN has been designed to generate potential dense rewards and provide proper guidance for distributed RL optimization. Equipped with a GMAN, DGMARL-MVP has effectively solved the problem of optimization direction ambiguity caused by reward sparsity via the enhanced Bellman equation. In addition, this paper has proposed a GNN-based intersecting cognition, where the construction of vehicle graphs encourages a deep coupling between traffic information and multi-agent information. It thoroughly extracted and utilized the multi-source heterogeneous data of urban traffic and the complicated multi-agent interaction features, thus considerably improving pursuing efficiency. Extensive experimental results have demonstrated that the DGMARL-MVP can significantly improve pursuing success rate up to 85.67
DECLARATIONS
Acknowledgments
The authors would like to thank the editor-in-chief, the associate editor, and the anonymous reviewers for their valuable comments.
Authors' contributions
Made contributions to the research, idea generation, conception, and design of the work and wrote and edited the original draft: Zhang L, Li X, Yang Y, Wang Q
Made contributions to the algorithm design and simulation and developed the majority of the associated code for the simulation environment and the proposed method: Li X, Yuan Z, Yang Y
Participated in part of the experimental data analysis and visualizations and performed data collation and related tasks: Li L, Xu C, Wang Q
Performed critical review and revision and provided administrative, technical, and material support: Zhang L, Li L, Xu C
Availability of data and materials
The codes of this paper are open-sourced and available at https://github.com/BUPT-ANTlab/DGMARL-MVP.
Financial support and sponsorship
This work is supported by the National Natural Science Foundation of China (Grants No. 61971096 and No. 62176024), the National Key R & D Program of China (2022ZD01161, 2022YFB2503202), Beijing Municipal Science & Technology Commission (Grant No. Z181100001018035) and Engineering Research Center of Information Networks, Ministry of Education.
Conflicts of interest
All authors declared that there are no conflicts of interest.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Copyright
© The Author(s) 2023.
REFERENCES
1. Chen C, Zou W, Xiang Z. Event-triggered consensus of multiple uncertain euler-lagrange systems with limited communication range. IEEE Trans Syst Man Cybern, Syst 2023;53:5945-54.
2. Boin C, Lei L, Yang SX. AVDDPG-Federated reinforcement learning applied to autonomous platoon control. Intell Robot 2022;2:145-67.
3. Zhu Z, Pivaro N, Gupta S, Gupta A, Canova M. Safe model-based off-policy reinforcement learning for eco-Driving in connected and automated hybrid electric vehicles. IEEE Trans Intell Veh 2022;7:387-98.
4. Cao Z, Xu S, Jiao X, Peng H, Yang D. Trustworthy safety improvement for autonomous driving using reinforcement learning. Trans Res Part C-Emer Technol 2022;138:103656.
5. Patrol guide. section: Tactical operations. procedure no: 221-15; 2016. Available from:
6. Qi Q, Zhang X, Guo X. A Deep Reinforcement Learning Approach for the Pursuit Evasion Game in the Presence of Obstacles. In: 2020 IEEE International Conference on Real-time Computing and Robotics (RCAR). IEEE; 2020. pp. 68–73.
7. Xu B, Wang Y, Wang Z, Jia H, Lu Z. Hierarchically and cooperatively learning traffic signal control. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35; 2021. pp. 669–77.
8. Li S, Yan Z, Wu C. Learning to delegate for large-scale vehicle routing. Adv Neural Inf Process Syst 2021;34. Available from:
9. Garcia E, Casbeer DW, Von Moll A, Pachter M. Multiple pursuer multiple evader differential games. IEEE Trans Automat Contr 2020;66:2345-50.
10. Xu Y, Yang H, Jiang B, Polycarpou MM. Multiplayer pursuit-evasion differential games with malicious pursuers. IEEE Trans Automat Contr 2022;67:4939-46.
11. Lopez VG, Lewis FL, Wan Y, Sanchez EN, Fan L. Solutions for multiagent pursuit-evasion games on communication graphs: finite-time capture and asymptotic behaviors. IEEE Trans Automat Contr 2020;65:1911-23.
12. Pan T, Yuan Y. A region-based relay pursuit scheme for a pursuit-evasion game with a single evader and multiple pursuers. IEEE Trans Syst Man Cybern, Syst 2023;53:1958-69.
13. Jia S, Wang X, Shen L. A continuous-time markov decision process-based method with application in a pursuit-evasion example. IEEE Trans Syst Man Cybern, Syst 2016;46:1215-25.
14. De Souza C, Newbury R, Cosgun A, et al. Decentralized multi-agent pursuit using deep reinforcement learning. IEEE Robot Autom Lett 2021;6:4552-59.
15. Zhang R, Zong Q, Zhang X, Dou L, Tian B. Game of drones: multi-uav pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans Neural Netw Learn Syst 2022; doi: 10.1109/TNNLS.2022.3146976.
16. Yang Y, Li X, Yuan Z, Wang Q, Xu C, et al. Graded-Q reinforcement learning with information-enhanced state encoder for hierarchical collaborative multi-vehicle pursuit. In: 2022 18th International Conference on Mobility, Sensing and Networking (MSN); 2022. pp. 534–41.
17. Zheng Z, Duan H. UAV maneuver decision-making via deep reinforcement learning for short-range air combat. Intell Robot 2023;3:76-94.
18. Durugkar I, Gemp I, Mahadevan S. Generative multi-adversarial networks. In: International Conference on Learning Representations (ICLR); 2017. Available from:
19. Wang Z, Zhu H, He M, et al. Gan and multi-agent drl based decentralized traffic light signal control. IEEE Trans Veh Technol 2021;71:1333-48.
20. Zhan H, Tao F, Cao Y. Human-guided robot behavior learning: a gan-assisted preference-based reinforcement learning approach. IEEE Robot Autom Lett 2021;6:3545-52.
21. Li L, Yao J, Wenliang L, et al. Grin: Generative relation and intention network for multi-agent trajectory prediction. Adv Neural Inf Process Syst 2021;34:27107-18.
22. Xia Y, Zhou J, Shi Z, Lu C, Huang H. Generative adversarial regularized mutual information policy gradient framework for automatic diagnosis. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34; 2020. pp. 1062–69.
23. Zheng C, Yang S, Parra-Ullauri JM, Garcia-Dominguez A, Bencomo N. Reward-reinforced generative adversarial networks for multi-agent systems. IEEE Trans Emerg Top Comput Intell 2021;6:479-88.
24. Liu Y, Wang W, Hu Y, et al. Multi-agent game abstraction via graph attention neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. pp. 7211–18.
25. Du W, Ding S, Zhang C, Shi Z. Multiagent Reinforcement Learning With Heterogeneous Graph Attention Network. IEEE Trans Neural Netw Learn Syst 2022;PP:1-10.
26. Liu Q, Li Z, Li X, Wu J, Yuan S. Graph convolution-based deep reinforcement learning for multi-agent decision-making in interactive traffic scenarios. In: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). IEEE; 2022. pp. 4074–81.
27. Xiaoqiang M, Fan Y, Xueyuan L, et al. Graph Convolution Reinforcement Learning for Decision-Making in Highway Overtaking Scenario. In: 2022 IEEE 17th Conference on Industrial Electronics and Applications (ICIEA). IEEE; 2022. pp. 417–22.
28. Chen Y, Shu T, Zhou X, et al. Graph attention network with spatial-temporal clustering for traffic flow forecasting in intelligent transportation system. IEEE Trans Intell Transport Syst 2022; doi: 10.1109/TITS.2022.3208952.
29. Lopez PA, Behrisch M, Bieker-Walz L, et al. Microscopic Traffic Simulation using SUMO. In: The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE; 2018.
Cite This Article

How to Cite
Li, X.; Yang, Y.; Wang, Q.; Yuan, Z.; Xu, C.; Li, L.; Zhang, L. A distributed multi-vehicle pursuit scheme: generative multi-adversarial reinforcement learning. Intell. Robot. 2023, 3, 436-52. http://dx.doi.org/10.20517/ir.2023.25
Download Citation
Export Citation File:
Type of Import
Tips on Downloading Citation
Citation Manager File Format
Type of Import
Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.
Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.