摘要
机器学习领域开始越来越受人们关注并且也是人工智能最新的探寻方向。最近几年强化学习的研究增长部分原因是在玩一些电子游戏中可以达到人类所达不到的高水平。使用基于策略的强化学习算法可以更好地适应游戏环境,探索出一种相对稳定的路径,达到全局最优的目标。本文研究的是基于强化学习Q-learning算法的Play Flappy Bird游戏。首先研究了强化学习的理论知识,对马尔可夫决策、动态规划、值函数近似、时间差分等相关理论进行了深入研究。重点研究了建立Flappy Bird游戏中的状态、行为、奖励数学模型,为了得到最优策略,对每一个状态下的目标是使总奖励最大化。在此基础上,本文将对深度卷积神经网络模型展开训练,从而可以识别游戏状态中的图像,并对其进行分类。系统仿真成功地运用深度Q-learning模型实现Flappy Bird的自我学习,探索概率ε在550,000更新中从0.6线性下降到0,学习率一开始非常陡峭,但随后达到稳定,在比较短的时间内实现收敛效果,训练误差较低。智能体训练达到理想效果,均值得分为86分,最高得分为335分,已经超过普通人类玩家,取得了良好的成绩。
The field of machine learning has begun to attract more and more attention and is also the latest direction of artificial intelligence. Part of the reason for the growth of research in reinforcement learning in recent years is that playing some video games can reach high levels that humans cannot reach. Using strategy-based reinforcement learning algorithms can better adapt to the game envi-ronment, explore a relatively stable path, and achieve the goal of global optimization. This article studies the Play Flappy Bird game based on the Q-learning algorithm of reinforcement learning. First, the theoretical knowledge of reinforcement learning is studied, and related theories such as Markov decision-making, dynamic programming, value function approximation, time difference and other related theories are deeply studied. The focus is on the establishment of mathematical models of states, behaviors, and rewards in Flappy Bird games. In order to obtain the optimal strategy, the goal for each state is to maximize the total reward. On this basis, this article will train the deep convolutional neural network model so that images in the game state can be identified and classi-fied. The system simulation successfully implements the self-learning of Flappy Bird using the deep Q-learning model. The exploration probability ε decreases linearly from 0.6 to 0 in the 550,000 updates. The learning rate is very steep at the beginning, but then it reaches a stable level, the convergence effect is achieved in a relatively short time, and the training error is low. The intelligent body training achieves the ideal effect. The average score is 86 points, and the highest score is 335 points, which has surpassed ordinary human players and achieved good results.
出处
《计算机科学与应用》
2021年第7期1994-2007,共14页
Computer Science and Application