摘要
无人机在应急救援中能有效辅助搜救人员缩短搜寻时间,减小生命财产损失。针对无人机搜寻规模变大时传统方法运行时间较长的问题,提出一种基于注意力机制的多轨迹策略优化方法。该方法基于深度强化学习算法,引入多轨迹采样技术,避免轨迹数据产生采样偏差;引入数据增强技术,进一步丰富轨迹数据特征;提出结合信息熵的目标损失函数,指导模型探索更优可行解空间。网络模型采用编码-解码框架,通过调整注意力机制网络提升编码器学习能力,利用添加残差子层提升解码器泛化能力。采用随机数据集和公共数据集分别验证模型的优化效果和泛化性能。试验结果显示,基于注意力机制的多轨迹策略优化方法相比理论最优解,在计算时间方面平均缩短了94.3%。此外,基于注意力机制的多轨迹策略优化方法相比对照方法的解,在平均差距方面相对提升了95.1%,在标准差方面降低了4.4%,为提升无人机救援搜寻效率提供了技术参考。
In emergency rescue,Unmanned Aerial Vehicles(UAVs) can effectively assist staff in shortening the search time and reducing the loss of life and property caused by disasters.The Traveling Salesman Problem(TSP) can be used to plan UAV search paths.But when the size of the search target increases,the traditional methods have the issue of a long running time.To address this problem,we proposed an Attention-based Multi-Trajectory Policy Optimization(AMTPO) method.Based on deep reinforcement learning,we used a multi-trajectory sampling technique to avoid any sampling bias in the trajectory data,which can prevent the omission of identical loop trajectories from different initial points.To further enhance the features of the trajectory data,we devised an instance augmentation strategy that can improve the network model's generalization performance.Entropy loss was added to the object loss function to improve network model exploration during training,which can guide the model to explore a more feasible solution space.The network model in this paper adopted the encoder-decoder framework.The encoder's learning capability was enhanced by adjusting the attention mechanism network.The decoder's generalization capability was improved by adding a residual sublayer.In the comparison experiments.We verified the optimization performance of AMTPO with three types of random instances.In the generalization experiments,we validated the generalization performance of the model with TSPLIB instances based on the distribution characteristics of several rescue search regions.In the ablation experiments,we tested the learning performance of different structural models,fully evaluated each innovation point of AMTPO,and selected the best-performing network model as the final model of this paper used for different test instances.The experimental results show that AMTPO has a calculation time that is on average about 94.3% shorter than Concorde's,an average gap is about 95.1% less than POMO's in relative terms,and a standard deviation of path le
作者
王鹏
王小清
吴仁彪
WANG Peng;WANG Xiaoqing;WU Renbiao(Tianjin Key Laboratory of Intelligent Signal and Image Processing,Civil Aviation University of China,Tianjin 300300,China)
出处
《安全与环境学报》
CAS
CSCD
北大核心
2023年第12期4381-4391,共11页
Journal of Safety and Environment
基金
国家自然科学基金委员会-中国民用航空局民航联合研究基金项目(U2133204)
国家自然科学基金项目(62141108)
中国民航大学国家自然科学基金配套专项(3122022PT01)。
关键词
公共安全
救援搜寻
旅行商问题
深度强化学习
注意力机制
多轨迹采样
数据增强
public safety
rescue search
traveling salesman problem
deep reinforcement learning
attention mechanism
multi⁃trajectory sampling
instance augmentation