摘要
军事行动、反恐突击等强对抗场景中,实时信息的碎片化、不确定性对制定具有博弈优势的弹性行动方案提出了更高的要求,研究具有自学习能力的智能行动策略规划方法已成为编队级强对抗任务的核心问题.针对复杂场景下行动策略规划状态表征困难、数据效率低下等问题,提出了基于预测编码的样本自适应行动策略规划方法.利用自编码模型压缩表示任务的原始状态空间,通过任务环境的状态转移样本,在低维度状态空间中使用混合密度分布网络对任务环境的动态模型进行学习,获得了表征环境动态性的预测编码;基于预测编码展开行动策略规划研究,利用时间差分敏感的样本自适应方法对状态评估值函数进行预测,改善了数据效率,加速了算法收敛.为了验证算法的有效性,基于全国兵棋推演大赛机机挑战赛的想定,构建了包含大赛获奖选手操作策略的5种规则智能体,利用消融实验验证编码方式、样本采样策略等不同因子组合对算法的影响,并使用Elo评分机制对各个智能体进行排序;实验结果表明:基于预测编码的样本自适应算法——MDN-AF得分排序最高,对战平均胜率为71%,其中大比分获胜局占比为67.6%,而且学习到了自主波次划分、补充侦察策略、“蛇形”打击策略、轰炸机靠后突袭等4种长时行动策略.该算法框架应用于2020年全国兵棋推演大赛的智能体开发,并获得了全国一等奖.
With the development of intelligent warfare,the fragmentation and uncertainty of real-time information in highly competitive scenarios such as military operations and anti-terrorism assault put forward higher requirements for making flexible policy with game advantages.The research of intelligent policy learning method with self-learning ability has become the core issue of formation-level tasks.Faced with difficulties in state representation and low data utilization efficiency,a sample adaptive policy learning method is proposed based on predictive coding.The auto-encoder model is applied to compress the original task state space,and the predictive coding of the dynamic environment is obtained through the state transition samples of the environment combined with the autoregressive model using the mixed density distribution network,which improves the capacity of the task state representation.Temporal difference error is utilized by the predictive-coding-based sample adaptive method to predict the value function,which improves the data efficiency and accelerates the convergence of the algorithm.To verify its effectiveness,a typical air combat scenario is constructed based on the previous national wargame competition platforms,where five specially designed rule-based agents are included by the contestants.The ablation experiments are implemented to verify the influence of different factors with regard to coding strategies and sampling policies while the Elo scoring mechanism is adopted to rank the agents.Experimental results confirm that MDN-AF,the sample adaptive algorithm based on predictive coding,reaches the highest score with an average winning rate of 71%,67.6%of which are easy wins.Moreover,it has learned four kinds of interpretable long-term strategies including autonomous wave division,supplementary reconnaissance,“snake”strike and bomber-in-the-rear formation.In addition,the agent applying this algorithm framework has won the national first prize of 2020National Wargame Competition.
作者
梁星星
马扬
冯旸赫
张驭龙
张龙飞
廖世江
刘忠
LIANG Xing-Xing;MA Yang;FENG Yang-He;ZHANG Yu-Long;ZHANG Long-Fei;LIAO Shi-Jiang;LIU Zhong(College of Systems Engineering,National University of Defense Technology,Changsha 410072,China;31002 Troops)
出处
《软件学报》
EI
CSCD
北大核心
2022年第4期1477-1500,共24页
Journal of Software
基金
国家自然科学基金(71701205)。
关键词
行动规划
强化学习
兵棋推演
预测编码
样本自适应
action planning
reinforcement learning
wargame
predictive coding
sample adaptive