摘要
蒙特卡洛树搜索(Monte Carlo tree search, MCTS)将强化学习的反馈优化与生长树的动态规划相结合,在输出当前状态的最佳动作的同时极大地减少了计算量,因此成为开放环境下众多领域智能系统的关键通用方法.但由于计算资源匮乏或者计算成本昂贵等原因,完全充分地对树结构进行搜索是难以实现的,因此在有限的预算下高效合理地分配计算资源从而获得当前状态下的最优动作是目前研究的一个重要问题.现有大多数算法仅以识别准确率作为性能指标,通过实验对比验证算法性能,缺少对算法的识别误差和影响因素的分析,从而降低了算法的可信性和可解释性.针对该问题,选择基础核心的2名玩家、完全信息、零和博弈场景,提出了固定预算设定下MCTS抽象模型的最优行动识别算法DLU——基于相对熵置信区间的纯探索(relative entropy confidence interval based pure exploration).首先提出了基于相对熵置信区间的估值方法对叶子节点胜率进行估计,其可以从底层提高树节点估值准确性;其次给出了第1层节点值估计、最优节点选择策略以形成完整算法流程;然后推导了DLU算法的识别误差上界,并分析了算法性能的影响因素;最后在人造树模型和井字棋2种场景下验证算法性能.实验结果表明,在人造树模型上基于相对熵的算法类具有更高的准确度,且模型越复杂识别难度越高时,该算法类的性能优势越显著.在井字棋场景下,DLU算法能有效地识别最优动作.
ion model with fixed budget setting.Firstly,a relative entropy-based confidence interval estimation method is constructed to estimate the win rate of leaf,which can essentially improve the valuation accuracy of tree nodes.Secondly,value estimation and node selection strategy for depth-one nodes are proposed to form the complete algorithm flow.Then the upper bound for output error of DLU algorithm is derived and the influencing factors on the algorithm performance are analysed.Finally,the algorithm performance is verified in two scenarios:artificial tree model and tic-tac-toe.The experimental results show that the algorithm class based on relative entropy has higher accuracy on the artificial tree model,and the performance advantage is more significant when the model is more complex and the recognition difficulty is higher.In the tic-tac-toe scenario,the DLU algorithm can effectively identify the best move.
作者
刘郭庆
钱宇华
张亚宇
王婕婷
Liu Guoqing;Qian Yuhua;Zhang Yayu;Wang Jieting(Institute of Big Data Science and Industry,Shanxi University,Taiyuan 030006;Key Laboratory of Computational Intelligence and Chinese Information Processing(Shanxi University),Ministry of Education,Taiyuan 030006;School of Computer and Information Technology,Shanxi University,Taiyuan 030006)
出处
《计算机研究与发展》
EI
CSCD
北大核心
2023年第8期1780-1794,共15页
Journal of Computer Research and Development
基金
国家自然科学基金重点项目(62136005)
国家重点研发计划项目(2021ZD0112400)。
关键词
蒙特卡洛树搜索
最优动作识别
多臂赌博机
误差最小化
强化学习
Monte Carlo tree search(MCTS)
best action identification
multi-armed bandit(MAB)
error minimization
reinforcement learning