两方零和马尔科夫博弈下的策略梯度算法

Policy Gradient Algorithm in Two-Player Zero-Sum Markov Games

下载PDF

导出

摘要在两方零和马尔科夫博弈中,由于玩家策略会受到另一个玩家策略的影响,传统的策略梯度定理只适用于交替训练两个玩家的策略.为了实现同时训练两个玩家的策略,文中给出两方零和马尔科夫博弈下的策略梯度定理.然后,基于该策略梯度定理,提出基于额外梯度的REINFORCE算法,可使玩家的联合策略收敛到近似纳什均衡.文中从多个维度分析算法的优越性.首先,在同时移动博弈游戏上的对比实验表明,文中算法的收敛性和收敛速度较优.其次,分析文中算法得到的联合策略的特点,并验证这些联合策略达到近似纳什均衡.最后,在不同难度等级的同时移动博弈游戏上的对比实验表明,文中算法在更大的难度等级下仍能保持不错的收敛速度. In two-player zero-sum Markov games,the traditional policy gradient theorem is only applied to alternate training of two players due to the influence of one player's policy on the other player's policy.To train two players at the same time,the policy gradient theorem in two-player zero-sum Markov games is proposed.Then,based on the policy gradient theorem,an extra-gradient based REINFORCE algorithm is proposed to achieve approximate Nash convergence of the joint policy of two players.The superiority of the proposed algorithm is analyzed in multiple dimensions.Firstly,the comparative experiments on simultaneous-move game show that the convergence and convergence speed of the proposed algorithm are better.Secondly,the characteristics of the joint policy obtained by the proposed algorithm are analyzed and these joint policies are verified to achieve approximate Nash equilibrium.Finally,the comparative experiments on simultaneous-move game with different difficulty levels show that the proposed algorithm holds a good convergence speed at higher difficulty levels.

作者李永强周键冯宇冯远静 LI Yongqiang;ZHOU Jian;FENG Yu;FENG Yuanjing(College of Information Engineering,Zhejiang University of Technology,Hangzhou 310023)

机构地区浙江工业大学信息工程学院

出处《模式识别与人工智能》 EI CSCD 北大核心 2023年第1期81-91,共11页 Pattern Recognition and Artificial Intelligence

基金国家自然科学基金面上项目(No.62073294) 浙江省自然科学基金重点项目(No.LZ21F030003)资助。

关键词马尔科夫博弈零和博弈策略梯度定理近似纳什均衡 Markov Game Zero-Sum Game Policy Gradient Theorem Approximate Nash Equilibrium

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1郭广滨,郭立新.时域物理光学后向散射近场线积分表达式[J].电波科学学报,2017,32(4):385-390. 被引量：2
2宁希(译).关于降价,最有价值的教训[J].上海质量,2022(12):44-45.
3李书华,吴宗扬,贝璟,余承斌,张代胜.基于博弈论的一体化防撞梁多目标优化设计[J].汽车技术,2023(2):9-14. 被引量：1
4林少敏.国企供应链管理中的采购决策研究[J].中国物流与采购,2022(24):38-41. 被引量：2
5游彩莲,林广民,曹宁.聚焦射频紧肤系统联合A型肉毒毒素在面部年轻化中的应用[J].中国医疗美容,2022,12(12):14-17. 被引量：3
6戴胜利,李筱雅.流域生态补偿协同共担机制的运作逻辑——以新安江流域为例[J].行政论坛,2022,29(6):109-117. 被引量：14
7李江源.名校集团的价值分析[J].教育研究与实验,2022(6):1-9.
8吴志成.以大国担当践行全球安全倡议[J].国际论坛,2023,25(1):15-20. 被引量：4
9杨晓华.市场化用户交易数字化建设方法与研究[J].中文科技期刊数据库（引文版）工程技术,2022(9):265-268.
10农庆琴,王媛媛.基于目标端影响模型与次模性的预算分配博弈问题分析[J].中国海洋大学学报（自然科学版）,2023,53(2):153-158.

模式识别与人工智能

2023年第1期

浏览历史

内容加载中请稍等...

两方零和马尔科夫博弈下的策略梯度算法

相关作者

相关机构

相关主题

浏览历史