两方零和马尔科夫博弈策略梯度算法及收敛性分析

Policy gradient algorithm and its convergence analysis for two-player zero-sum Markov games

下载PDF

导出

摘要为了解决基于策略的强化学习方法在两方零和马尔科夫博弈中学习效率低下的问题,提出同时更新双方玩家策略的近似纳什均衡策略优化算法.将两方零和马尔科夫博弈问题描述为最大最小优化问题,针对参数化策略,给出马尔科夫博弈的策略梯度定理,并通过近似随机策略梯度的推导,为算法实施提供可行性基础.通过比较分析不同的最大最小问题梯度更新方法,发现额外梯度相较于其他方法具有更好的收敛性能.基于这一发现,提出基于额外梯度的近似纳什均衡策略优化算法,并给出算法的收敛性证明.在Oshi-Zumo游戏上,使用表格式softmax参数化策略以及神经网络作为参数化策略,验证不同游戏规模场景下算法的有效性.通过对比实验,验证算法相对于其他方法的收敛性和优越性. An approximate Nash equilibrium policy optimization algorithm that simultaneously updated the policy of both players was proposed,in order to resolve the problem of low learning efficiency of the policy-based reinforcement learning method in the two-player zero-sum Markov game.The two-player zero-sum Markov game problem was described as a maximum-minimum optimization problem.The policy gradient theorem of the Markov game was given for the parameterized policy,and it provided a feasibility basis for algorithm implementation through the derivation of the approximate stochastic policy gradient.Different gradient update methods for the maximum-minimum problem were compared and analyzed,and it was found that the extragradient had better convergence performance than other methods.An approximate Nash equilibrium policy optimization algorithm based on the extragradient was proposed based on this finding,and the convergence proof of the algorithm was given.The tabular softmax parameterized policy and the neural network were used as parameterized policy on the Oshi-Zumo game,to verify the effectiveness of the algorithm in different game scale scenarios.The convergence and superiority of the algorithm compared to other methods were verified through comparative experiments.

作者王卓李永强冯宇冯远静 WANG Zhuo;LI Yongqiang;FENG Yu;FENG Yuanjing(School of Information Engineering,Zhejiang University of Technology,Hangzhou 310000,China)

机构地区浙江工业大学信息工程学院

出处《浙江大学学报（工学版）》 EI CAS CSCD 北大核心 2024年第3期480-491,共12页 Journal of Zhejiang University：Engineering Science

基金国家自然科学基金资助项目(62073294) 浙江省自然科学基金资助项目(LZ21F030003)。

关键词两方零和马尔科夫博弈强化学习策略优化额外梯度纳什均衡神经网络 two-player zero-sum Markov game reinforcement learning policy optimization extragradient Nash equilibrium neural network

分类号 TU18 [建筑科学—建筑理论]

引文网络
相关文献

参考文献1

1吴哲,李凯,徐航,兴军亮.一种用于两人零和博弈对手适应的元策略演化学习算法[J].自动化学报,2022,48(10):2462-2473. 被引量：1

二级参考文献6

1周志华.AlphaGo专题介绍[J].自动化学报,2016,42(5):670-670. 被引量：10
2郭潇逍,李程,梅俏竹.深度学习在游戏中的应用[J].自动化学报,2016,42(5):676-684. 被引量：22
3赵冬斌,邵坤,朱圆恒,李栋,陈亚冉,王海涛,刘德荣,周彤,王成红.深度强化学习综述:兼论计算机围棋的发展[J].控制理论与应用,2016,33(6):701-717. 被引量：127
4孙长银,穆朝絮.多智能体深度强化学习的若干关键科学问题[J].自动化学报,2020,46(7):1301-1312. 被引量：82
5沈宇,韩金朋,李灵犀,王飞跃.游戏智能中的AI——从多角色博弈到平行博弈[J].智能科学与技术学报,2020,2(3):205-213. 被引量：12
6梁星星,冯旸赫,马扬,程光权,黄金才,王琦,周玉珍,刘忠.多Agent深度强化学习综述[J].自动化学报,2020,46(12):2537-2557. 被引量：34

1陈哲远.对议题式教学的再思考[J].中学政治教学参考,2023(27):52-54. 被引量：1
2江艳嫦.指向思维结构观察的高中地理问题链教学——以“海水的温度”为例[J].中华活页文选（高中版）,2023(23):21-23.
3林峰,泥红美.新课标视域下语文思辨性提问探究[J].黑龙江教师发展学院学报,2023,42(3):113-116. 被引量：1
4秦哲.关注问题梯度,提升解题能力——以“一次函数面积问题”教学设计为例[J].中学数学,2023(20):21-22. 被引量：1
5李永强,周键,冯宇,冯远静.两方零和马尔科夫博弈下的策略梯度算法[J].模式识别与人工智能,2023,36(1):81-91.
6杨晟琦,田明俊,司迎利,金琳乘.基于分层强化学习的无人机机动决策[J].火力与指挥控制,2023,48(8):48-52. 被引量：2
7郭广滨,郭立新.时域物理光学后向散射近场线积分表达式[J].电波科学学报,2017,32(4):385-390. 被引量：2
8全新的人类文明形态“新”在哪儿[J].领导决策信息,2024(6):30-30.
9周静媛,陆继亮.大花蕙兰产量小幅菱缩市场快速扩容[J].中国花卉园艺,2024(1):15-19.
10《铀矿冶》投稿须知[J].铀矿冶,2024,43(1).

浙江大学学报（工学版）

2024年第3期

浏览历史

内容加载中请稍等...

两方零和马尔科夫博弈策略梯度算法及收敛性分析

参考文献1

二级参考文献6

相关作者

相关机构

相关主题

浏览历史