期刊文献+

基于延迟策略的最大熵优势演员评论家算法 被引量:1

Maximum Entropy Superior Actor-critic Algorithm Based on Delay Strategy
下载PDF
导出
摘要 在强化学习中智能体通过与环境交互获得奖励值,更高的奖励值需要更优的策略,但在高维复杂的环境中,传统的强化学习算法交互产生的样本复杂度较高,并且会产生过估计问题,导致计算最优策略的过程产生较大的波动,算法难以收敛.针对上述问题,提出了一种基于延迟策略的最大熵优势演员评论家强化学习算法(DAAC).DAAC算法基于传统的策略梯度演员评论家算法框架,使用了两个评论家网络,分别计算状态值函数和动作的优势估计值函数并最大化目标策略的预期熵,在评论家网络中使用延迟策略更新的技巧.该算法在基于Linux平台的OpenAI Gym的物理仿真模拟器Mu JoCo进行了实验,并与传统的强化学习算法DQN,TRPO,DDPG在不同的机器人模拟器中作对比,实验结果表明,DAAC算法有效地降低了计算过程的波动性,使策略更快收敛到最优解并获得了更高的奖励值. In reinforcement learning,the agent obtains the reward value by interacting with the environment,and the higher reward value requires a better strategy,but in the high-dimensional complex environment,the traditional reinforcement learning algorithm has a higher sample complexity and will have higher complexity.The estimation problem has been generated,which leads to large fluctuations in the process of calculating the optimal strategy,and the algorithm is difficult to converge.Aiming at the above problems,a maximum entropy dominant actor critic reinforcement learning algorithm(DAAC)based on delay strategy is proposed.Based on the traditional strategy gradient actor critic algorithm framework,the DAAC algorithm uses two critics networks to calculate the state value function and the action’s advantage estimate function and maximize the expected entropy of the target strategy.The delay strategy is used in the critic network.Updated tips.The algorithm is tested on the virtual simulation simulator Mu JoCo of OpenAI Gym based on Linux platform,and compared with the traditional reinforcement learning algorithms DQN,TRPO,DDPG in different robot simulators.The experimental results show that the DAAC algorithm effectively reduces The volatility of the calculation process allow s the strategy to converge to the optimal solution more quickly and obtain higher reward values.
作者 祁文凯 桑国明 QI Wen-kai;SANG Guo-ming(College of Information Science and Technology,Dalian Maritime University,Dalian 116026,China)
出处 《小型微型计算机系统》 CSCD 北大核心 2020年第8期1656-1664,共9页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(61672122)资助 中央高校基本科研业务费“十三五”重点科研项目(3132016348)资助 中央高校基本科研业务费项目(3132019207)资助。
关键词 强化学习 策略梯度 延迟更新 最大熵 演员评论家网络 reinforcement learning strategy gradient delayed update maximum entropy actor-critic network
  • 相关文献

参考文献4

二级参考文献13

共引文献553

同被引文献9

引证文献1

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部