期刊文献+

不完美信息扩展式博弈中在线虚拟遗憾最小化 被引量:8

Online Counterfactual Regret Minimization in Repeated Imperfect Information Extensive Games
下载PDF
导出
摘要 研究在不完美信息扩展式博弈中对次优对手弱点的利用.针对该领域中一种常用方法——对手建模方法——的不足,提出了从遗憾最小化的角度来利用次优对手弱点的思想,并基于一种离线的均衡计算方法——虚拟遗憾最小化方法——将其扩展到在线博弈的场景中,实现对次优对手弱点的利用.提出了从博弈结果中估计各个信息集的虚拟价值的方法,给出2种估计手段:静态估计法和动态估计法.静态估计法直接从博弈结果的分布中进行估计,并对每个结果给以相等的估计权重;而动态估计法则对新产生的博弈结果给以较高的估计权重,以便快速地适应对手的策略变化.基于2种估计方法,提出在线博弈中虚拟遗憾最小化的算法,并在基于单牌扑克的实验中,与4种在线学习算法(DBBR,MCCFR-os,Q-learning,Sarsa)进行了对比.实验结果显示所提出的算法不仅对较弱对手的利用效果最好,还能在与4种对比算法的比赛中取得最高的胜率. In this paper, we consider the problem of exploiting suboptimal opponents in imperfect information extensive games. Most previous works use opponent modeling and find a best response to exploit the opponent. However, a potential drawback of such approach is that the best response may not be a real one, since the modeled strategy actually may not be the same as what the opponent plays. We try to solve this problem from the perspective of online regret minimization, which avoids opponent modeling. We make extensions to a state-of-the-art equilibrium-computing algorithm called counterfactual regret minimization (CFR). The core problem is how to compute the counterfactual values in online scenarios. We propose to learn approximations of these values from the results produced by the game and introduce two different estimators: static estimator which learns the values directly from the results' distribution, and dynamic estimator which assigns larger weight to new sampled results than older ones for better adapting to dynamic opponents. Two algorithms for online regret minimization are proposed based on the two estimators. We also give the conditions under which the values estimated by our estimators are equal to the true values, showing the relationship between CFR and our algorithms. Experimental results in one-card poker show that our algorithms not only perform the best when exploiting some weak opponents, but also outperform some state-of- the-art algorithms by achieving the highest win rate in matches with a few hands.
出处 《计算机研究与发展》 EI CSCD 北大核心 2014年第10期2160-2170,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61035003 61175042 61321491 61202212) 江苏省自然科学基金重点项目(BK2011005) 江苏省普通高校研究生科研创新计划基金项目(CXLX13_049)
关键词 扩展式博弈 不完美信息 遗憾最小化 虚拟遗憾最小化 静态估计法 动态估计法 extensive games minimization static estimator imperfect information regret minimization counterfactual regret dynamic estimator
  • 相关文献

参考文献15

  • 1Osborne M, Rubinstein A. A Course in Game Theory [M]. Cambridge, MA: MIT Press, 1994: 200-201. 被引量:1
  • 2Billings D, Burch N, Davidson A, et al. Approximating game-theoretic optimal strategies for full-scale poker [C] // Proc of the 18th Int Joint Conf on Artificial Intelligence. Mahwah, NJ: Lawrence Erlbaum Associates, 2003: 661-668. 被引量:1
  • 3Hoda S, Gilpin A, Pena J, et al. Smoothing techniques for computing nash equilibria of sequential games [J]. Mathematics of Operations Research, 2010, 35(2): 494-512. 被引量:1
  • 4Zinkevich M, Johanson M, Bowling M, et al. Regret minimization in games with incomplete information [C] // Proc of the 21st Annual Conf on Neural Information Processing Systems. Vancouver, CA: Curran Associates Inc., 2007: 1729-1736. 被引量:1
  • 5Gibson R, Lanctot M, Burch N, et al. Generalized sampling and variance in counterfactual regret minimization [C] //Proc of the 26th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2012: 1355-1361. 被引量:1
  • 6Johanson M, Bard N, Lanctot M, et al. Efficient Nash equilibrium approximation through monte carlo counterfactual regret minimization [C] //Proc of the 11th Int Conf on Autonomous Agents and Multiagent Systems (AAMAS). Liverpool, UK: International Foundation of Autonomous Agents and Multi-Agent Systems (lFAAMAS) Press, 2012: 837-846. 被引量:1
  • 7Lanctot M, Waugh K, Zinkevich M, et al. Monte Carlo sampling for regret minimization in extensive games [C] // Proc of the 23rd Annual Conf on Neural Information Processing Systems. Vancouver, CA: Curran Associates Inc. ,2009: 1078-1086. 被引量:1
  • 8Johanson M, and Bowling M. Data biased robust counter strategies [C] I!Proc of the 12th Int Conf on Artificial Intelligence and Statistics (AIST A TS). Brookline, MA: Microtome Publishing, 2009: 264-271. 被引量:1
  • 9Ganzfried S, and Sandholm T. Game theory-based opponent modeling in large imperfect-information games [C] //Proc of the 10th lnt Conf on Autonomous Agents and Multi-Agent Systems (AAMAS). Liverpool, UK: International Foundation of Autonomous Agents and Multi-Agent Systems CIFAAMAS) Press, 2011: 533-540. 被引量:1
  • 10Sutton R, Barto A. Reinforcement Learning: An Introduction [M]. Cambridge, MA: MIT Press, 1998. 被引量:1

二级参考文献10

  • 1王伟东,朱清新.无线传感器网络中一种层次分簇算法及协作性分析(英文)[J].软件学报,2006,17(5):1157-1167. 被引量:21
  • 2KANNAN R,RAY L,KALIDINDI R,et al. Max-rain length-energyconstrained routing in wireless sensor networks[ C ]//Proc of the 1st European Conference on Wireless Sensor Networks. Berlin:Springer, 2004 : 234 -249. 被引量:1
  • 3KANNAN R, IYENGAR S. Game-theoretic models for reliable pathlength and energy-constrainod routing with data aggregation in wireless sensor networks[ J]. IEEE Trans on Selected Areas of Communications, 2004,22 ( 6 ) : 1141 - 1150. 被引量:1
  • 4MANJESHWAR A, AGARWAL D P. TEEN : a routing protocol for enhanced efficiency in wireless sensor networks[ C ]//Proc of the 1 st International Workshop on Parallel and Distributed Computing Issues in Wireless Networks and Mobile Computing. San Francisco: IEEE Computer Society,2001:2009-2015. 被引量:1
  • 5LI Cheng-fa, YE Mao, CHEN Gui-hai, et ol. An energy-efficient unequal clustering mechanism for wireless sensor networks [ C ]//Proc of the 2nd IEEE International Conference on Mobile Ad hoe and Sensor Systems( MASS 2005 ). Washington DC : [ s. n. ] ,2005:597-604. 被引量:1
  • 6REISFELD D, WOLFSON H, YESHURUN Y. Context free attentional operators: the generalized symmetry transforms [ J ]. International Journal of Computer Vision, 1995,14 (2) : 119-130. 被引量:1
  • 7REISFELD D, YESHURUN Y. Preproeessing of face images: detection of features and pose normalization [ J ]. Computer Vision and Image Understanding, 1998,71 ( 3 ) :413-430. 被引量:1
  • 8李慧芳,姜胜明,韦岗.无线传感器网络中基于博弈论的路由建模[J].传感技术学报,2007,20(9):2075-2079. 被引量:14
  • 9张志芳,刘木兰.理性密钥共享的扩展博弈模型[J].中国科学:信息科学,2012,42(1):32-46. 被引量:2
  • 10任丰原,黄海宁,林闯.无线传感器网络[J].软件学报,2003,14(7):1282-1291. 被引量:1709

共引文献7

同被引文献39

引证文献8

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部