不完美信息扩展式博弈中在线虚拟遗憾最小化被引量：8

Online Counterfactual Regret Minimization in Repeated Imperfect Information Extensive Games

下载PDF

导出

摘要研究在不完美信息扩展式博弈中对次优对手弱点的利用.针对该领域中一种常用方法——对手建模方法——的不足,提出了从遗憾最小化的角度来利用次优对手弱点的思想,并基于一种离线的均衡计算方法——虚拟遗憾最小化方法——将其扩展到在线博弈的场景中,实现对次优对手弱点的利用.提出了从博弈结果中估计各个信息集的虚拟价值的方法,给出2种估计手段:静态估计法和动态估计法.静态估计法直接从博弈结果的分布中进行估计,并对每个结果给以相等的估计权重;而动态估计法则对新产生的博弈结果给以较高的估计权重,以便快速地适应对手的策略变化.基于2种估计方法,提出在线博弈中虚拟遗憾最小化的算法,并在基于单牌扑克的实验中,与4种在线学习算法(DBBR,MCCFR-os,Q-learning,Sarsa)进行了对比.实验结果显示所提出的算法不仅对较弱对手的利用效果最好,还能在与4种对比算法的比赛中取得最高的胜率. In this paper, we consider the problem of exploiting suboptimal opponents in imperfect information extensive games. Most previous works use opponent modeling and find a best response to exploit the opponent. However, a potential drawback of such approach is that the best response may not be a real one, since the modeled strategy actually may not be the same as what the opponent plays. We try to solve this problem from the perspective of online regret minimization, which avoids opponent modeling. We make extensions to a state-of-the-art equilibrium-computing algorithm called counterfactual regret minimization （CFR）. The core problem is how to compute the counterfactual values in online scenarios. We propose to learn approximations of these values from the results produced by the game and introduce two different estimators： static estimator which learns the values directly from the results＇ distribution, and dynamic estimator which assigns larger weight to new sampled results than older ones for better adapting to dynamic opponents. Two algorithms for online regret minimization are proposed based on the two estimators. We also give the conditions under which the values estimated by our estimators are equal to the true values, showing the relationship between CFR and our algorithms. Experimental results in one-card poker show that our algorithms not only perform the best when exploiting some weak opponents, but also outperform some state-of- the-art algorithms by achieving the highest win rate in matches with a few hands.

作者胡裕靖高阳安波

机构地区软件新技术国家重点实验室(南京大学) 中国科学院计算技术研究所智能信息处理重点实验室

出处《计算机研究与发展》 EI CSCD 北大核心 2014年第10期2160-2170,共11页 Journal of Computer Research and Development

基金国家自然科学基金项目(61035003 61175042 61321491 61202212) 江苏省自然科学基金重点项目(BK2011005) 江苏省普通高校研究生科研创新计划基金项目(CXLX13_049)

关键词扩展式博弈不完美信息遗憾最小化虚拟遗憾最小化静态估计法动态估计法 extensive games minimization static estimator imperfect information regret minimization counterfactual regret dynamic estimator

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献15

1Osborne M, Rubinstein A. A Course in Game Theory [M]. Cambridge, MA: MIT Press, 1994: 200-201. 被引量：1
2Billings D, Burch N, Davidson A, et al. Approximating game-theoretic optimal strategies for full-scale poker [C] // Proc of the 18th Int Joint Conf on Artificial Intelligence. Mahwah, NJ: Lawrence Erlbaum Associates, 2003: 661-668. 被引量：1
3Hoda S, Gilpin A, Pena J, et al. Smoothing techniques for computing nash equilibria of sequential games [J]. Mathematics of Operations Research, 2010, 35(2): 494-512. 被引量：1
4Zinkevich M, Johanson M, Bowling M, et al. Regret minimization in games with incomplete information [C] // Proc of the 21st Annual Conf on Neural Information Processing Systems. Vancouver, CA: Curran Associates Inc., 2007: 1729-1736. 被引量：1
5Gibson R, Lanctot M, Burch N, et al. Generalized sampling and variance in counterfactual regret minimization [C] //Proc of the 26th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2012: 1355-1361. 被引量：1
6Johanson M, Bard N, Lanctot M, et al. Efficient Nash equilibrium approximation through monte carlo counterfactual regret minimization [C] //Proc of the 11th Int Conf on Autonomous Agents and Multiagent Systems (AAMAS). Liverpool, UK: International Foundation of Autonomous Agents and Multi-Agent Systems (lFAAMAS) Press, 2012: 837-846. 被引量：1
7Lanctot M, Waugh K, Zinkevich M, et al. Monte Carlo sampling for regret minimization in extensive games [C] // Proc of the 23rd Annual Conf on Neural Information Processing Systems. Vancouver, CA: Curran Associates Inc. ,2009: 1078-1086. 被引量：1
8Johanson M, and Bowling M. Data biased robust counter strategies [C] I!Proc of the 12th Int Conf on Artificial Intelligence and Statistics (AIST A TS). Brookline, MA: Microtome Publishing, 2009: 264-271. 被引量：1
9Ganzfried S, and Sandholm T. Game theory-based opponent modeling in large imperfect-information games [C] //Proc of the 10th lnt Conf on Autonomous Agents and Multi-Agent Systems (AAMAS). Liverpool, UK: International Foundation of Autonomous Agents and Multi-Agent Systems CIFAAMAS) Press, 2011: 533-540. 被引量：1
10Sutton R, Barto A. Reinforcement Learning: An Introduction [M]. Cambridge, MA: MIT Press, 1998. 被引量：1

二级参考文献10

1王伟东,朱清新.无线传感器网络中一种层次分簇算法及协作性分析(英文)[J].软件学报,2006,17(5):1157-1167. 被引量：21
2KANNAN R,RAY L,KALIDINDI R,et al. Max-rain length-energyconstrained routing in wireless sensor networks[ C ]//Proc of the 1st European Conference on Wireless Sensor Networks. Berlin:Springer, 2004 : 234 -249. 被引量：1
3KANNAN R, IYENGAR S. Game-theoretic models for reliable pathlength and energy-constrainod routing with data aggregation in wireless sensor networks[ J]. IEEE Trans on Selected Areas of Communications, 2004,22 ( 6 ) : 1141 - 1150. 被引量：1
4MANJESHWAR A, AGARWAL D P. TEEN : a routing protocol for enhanced efficiency in wireless sensor networks[ C ]//Proc of the 1 st International Workshop on Parallel and Distributed Computing Issues in Wireless Networks and Mobile Computing. San Francisco: IEEE Computer Society,2001:2009-2015. 被引量：1
5LI Cheng-fa, YE Mao, CHEN Gui-hai, et ol. An energy-efficient unequal clustering mechanism for wireless sensor networks [ C ]//Proc of the 2nd IEEE International Conference on Mobile Ad hoe and Sensor Systems( MASS 2005 ). Washington DC : [ s. n. ] ,2005:597-604. 被引量：1
6REISFELD D, WOLFSON H, YESHURUN Y. Context free attentional operators: the generalized symmetry transforms [ J ]. International Journal of Computer Vision, 1995,14 (2) : 119-130. 被引量：1
7REISFELD D, YESHURUN Y. Preproeessing of face images: detection of features and pose normalization [ J ]. Computer Vision and Image Understanding, 1998,71 ( 3 ) :413-430. 被引量：1
8李慧芳,姜胜明,韦岗.无线传感器网络中基于博弈论的路由建模[J].传感技术学报,2007,20(9):2075-2079. 被引量：14
9张志芳,刘木兰.理性密钥共享的扩展博弈模型[J].中国科学：信息科学,2012,42(1):32-46. 被引量：2
10任丰原,黄海宁,林闯.无线传感器网络[J].软件学报,2003,14(7):1282-1291. 被引量：1709

共引文献7

1刘洪涛,程良伦.具有移动汇聚节点的环境监测系统设计[J].计算机工程与应用,2010,46(19):7-9. 被引量：7
2文莎,胡小青.可预测的动态联盟协同跟踪机制研究[J].计算机仿真,2010,27(7):74-77. 被引量：3
3田得润,李长云,张瑶,张军.博弈论在无线传感器网络路由机制中的应用[J].湖南工业大学学报,2012,26(1):55-60. 被引量：2
4刘丹,李桂英.面向林火预测的无线传感器节能算法的研究[J].计算机应用与软件,2012,29(10):141-144. 被引量：2
5黄加异,程良伦.一种聚类区域自适应调整的WSN能耗均衡分簇算法[J].计算机应用研究,2012,29(11):4276-4279. 被引量：10
6孙庆中,余强,宋伟.基于博弈论能耗均衡的WSN非均匀分簇路由协议[J].计算机应用,2014,34(11):3164-3169. 被引量：9
7林家泉,张天娇.基于博弈论能耗均衡的桥载监控网络路由协议[J].计算机工程与设计,2016,37(6):1456-1459. 被引量：1

同被引文献39

1胡鹏,艾欣,张朔,潘玺安.基于需求响应的分时电价主从博弈建模与仿真研究[J].电网技术,2020,44(2):585-592. 被引量：49
2郭昆健,高赐威,林国营,卢世祥,冯小峰.现货市场环境下售电商激励型需求响应优化策略[J].电力系统自动化,2020,44(15):28-37. 被引量：39
3刘尚合,孙国至.复杂电磁环境内涵及效应分析[J].装备指挥技术学院学报,2008,19(1):1-5. 被引量：111
4徐心和,邓志立,王骄,徐长明,刘纪红,马宗民.机器博弈研究面临的各种挑战[J].智能系统学报,2008,3(4):288-293. 被引量：40
5高晓飞,申普兵.网络安全主动防御技术[J].计算机安全,2009(1):38-40. 被引量：27
6姜伟,方滨兴,田志宏,张宏莉.基于攻防博弈模型的网络安全测评和最优主动防御[J].计算机学报,2009,32(4):817-827. 被引量：153
7吉鸿珠,顾乃杰.基于博弈论的网络安全量化评估算法[J].计算机应用与软件,2009,26(9):4-6. 被引量：3
8林旺群,王慧,刘家红,邓镭,李爱平,吴泉源,贾焰.基于非合作动态博弈的网络安全主动防御技术研究[J].计算机研究与发展,2011,48(2):306-316. 被引量：63
9吴军,徐昕,王健,贺汉根.面向多机器人系统的增强学习研究进展综述[J].控制与决策,2011,26(11):1601-1610. 被引量：22
10刘玉岭,冯登国,吴丽辉,连一峰.基于静态贝叶斯博弈的蠕虫攻防策略绩效评估[J].软件学报,2012,23(3):712-723. 被引量：33

引证文献8

1焦连庆,于敏,黄青,张志伟,何亚全.TAME法测定金龙消栓合剂中吲激酶单位效价[J].中草药,2000,31(4):267-268. 被引量：3
2潘子轩,许晓东,朱士瑞.基于扩展式博弈的网络安全防御策略研究[J].软件导刊,2018,17(10):191-193. 被引量：2
3臧正功,丁箐.基于遗憾最小化算法的谣言抑制与演化博弈模型[J].信息技术与网络安全,2020,39(7):61-66. 被引量：1
4王亚杰,丁傲冬,祁冰枝,张云博.基于预期收益策略与UCT的德州扑克算法[J].重庆理工大学学报（自然科学）,2021,35(3):166-173. 被引量：3
5何雨橙,丁尧相,周志华.三方众包市场中的发包方-平台博弈机制设计[J].计算机研究与发展,2022,59(11):2507-2519.
6罗俊仁,张万鹏,苏炯铭,魏婷婷,陈璟.计算机博弈中序贯不完美信息博弈求解研究进展[J].控制与决策,2023,38(10):2721-2748. 被引量：2
7张明悦,金芝,刘坤.合作-竞争混合型多智能体系统的虚拟遗憾优势自博弈方法[J].软件学报,2024,35(2):739-757.
8孙勇,王惠锋,孟祥东,李宝聚,王大亮,王尧,胡枭,陈厚合.基于不完全信息的工业园区多主体需求响应博弈策略研究[J].电工电能新技术,2024,43(2):65-77.

二级引证文献11

1董培智,朴晋华,党爱华,王婷婷,张蕻,张志伟.量反应平行线法在溶栓胶囊蚓激酶效价测定方法学研究中的应用[J].中国中药杂志,2010,35(11):1410-1414. 被引量：17
2张晓丽,杨洪武,吴品昌.蚯蚓用于抗血栓的加工方法[J].中国实验方剂学杂志,2011,17(22):24-26. 被引量：3
3王长春,曾照华,张跃华.“互联网+”网络信息安全现状与防护研究[J].软件导刊,2020,19(2):282-284. 被引量：11
4周子龙,文兴超,孙琦现,韩兆焱,关艾,胡煜寒.考虑外部环境和个体采信程度的谣言传播模型[J].辽宁科技大学学报,2020,43(6):473-477.
5邱虹坤,郑晓东,王亚杰.基于数据库和经验分析的桥牌混合策略打牌模型[J].重庆理工大学学报（自然科学）,2021,35(12):134-139. 被引量：2
6陈慧敏,曹继翔,张凌寒,郑万波.基于前景理论的施工现场安全员与工人监督行为仿真分析[J].软件导刊,2022,21(6):67-72. 被引量：1
7吴立成,吴启飞,钟宏鸣,李霞丽.“拱猪”游戏的深度蒙特卡洛博弈算法[J].重庆理工大学学报（自然科学）,2022,36(12):121-128. 被引量：2
8邱虹坤,郑晓东,王亚杰.引入合作竞争关系的桥牌叫牌数据库构建[J].重庆理工大学学报（自然科学）,2022,36(12):142-147. 被引量：1
9刘晓枫,刘广玉.温针灸辅以星蒌承气汤治疗老年脑梗死的疗效及对患者神经功能、氧化应激指标、血清可溶性E选择素和肿瘤坏死因子-α的影响[J].中国老年学杂志,2023,43(15):3626-3629. 被引量：4
10张小川,严明珠,涂飞,陈俊宇,魏乐天.一种大众麻将计算机博弈的快速出牌方法[J].重庆理工大学学报（自然科学）,2024,38(5):102-107.

1黄平,须德,张全寿.对两种估计单属性关系大小方法的可信度讨论[J].北方交通大学学报,1993,17(4):416-419.
2朱宪辰,李妍绮,曾华翔.不完美信息下序贯决策行为的一项实验考察——关于羊群行为的贝叶斯模型实验检验[J].经济研究,2008,43(6):145-156. 被引量：10
3张华鹏,张宏斌.基于重复博弈的Ad hoc网络合作转发模型[J].电子与信息学报,2014,36(3):703-707. 被引量：1
4岳海霏.虚拟世界：保险蓝海？[J].科技中国,2012(9):52-53.
5李新征,赵长林.电邮安全如何走出困境[J].网络运维与管理,2015,0(9):34-34.
6丁洪,彭长根,邝青青.混合策略下的理性交换协议模型[J].网络与信息安全学报,2016,2(3):68-75.
7晁仕德.网络安全技术在云计算下的实现途径分析[J].网络安全技术与应用,2015(8):85-85. 被引量：10
8晁仕德.网络安全技术在云计算下的实现途径分析[J].网络安全技术与应用,2015(9):66-67. 被引量：2
9面对降价请理性 24英寸宽屏LCD仍属高端[J].数码世界,2007,0(9):119-119.
10陈侠,赵明明,徐光延.基于模糊动态博弈的多无人机空战策略研究[J].电光与控制,2014,21(6):19-23. 被引量：11

计算机研究与发展

2014年第10期

浏览历史

内容加载中请稍等...

不完美信息扩展式博弈中在线虚拟遗憾最小化被引量：8

参考文献15

二级参考文献10

共引文献7

同被引文献39

引证文献8

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

不完美信息扩展式博弈中在线虚拟遗憾最小化 被引量：8

参考文献15

二级参考文献10

共引文献7

同被引文献39

引证文献8

二级引证文献11

相关作者

相关机构

相关主题

浏览历史

不完美信息扩展式博弈中在线虚拟遗憾最小化被引量：8