期刊文献+

基于经验指导的深度确定性多行动者-评论家算法 被引量:6

An Experience -Guided Deep Deterministic Actor -Critic Algorithm with Multi -Actor
下载PDF
导出
摘要 连续控制问题一直是强化学习研究的一个重要方向.近些年深度学习的发展以及确定性策略梯度(deterministic policy gradients, DPG)算法的提出,为解决连续控制问题提供了很多好的思路.这类方法大多在动作空间中加入外部噪声源进行探索,但是它们在一些连续控制任务中的表现并不是很好.为更好地解决探索问题,提出了一种基于经验指导的深度确定性多行动者评论家算法(experience-guided deep deterministic actor-critic with multi-actor, EGDDAC-MA),该算法不需要外部探索噪声,而是从自身优秀经验中学习得到一个指导网络,对动作选择和值函数的更新进行指导.此外,为了缓解网络学习的波动性,算法使用多行动者评论家模型,模型中的多个行动者网络之间互不干扰,各自执行情节的不同阶段.实验表明:相比于DDPG,TRPO和PPO算法,EGDDAC-MA算法在GYM仿真平台中的大多数连续任务中有更好的表现. The continuous control task has always been an important research direction in reinforce-ment learning. In recent years, the development of deep learning (DL) and the advent of deterministic policy gradients algorithm (DPG), provide many good ideas for solving continuous control problems. The main difficulty faced by these methods is the exploration in the continuous action space. And some of them engage in exploratory behavior through external noise injection in the action space. However, this exploration method does not perform well in some continuous control tasks. This paper proposes an experience-guided deep deterministic actor-critic algorithm with multi-actor (EGDDAC-MA) without external noise, which learns a guiding network from excellent experiences to guide the updates of the actor network and the critic network. Besides, it uses a multi-actor actor-critic (AC) model which configures different actors for each phase in an episode. These actors are independent of each other and do not interfere with each other. Finally, the experimental results show that compared with DDPG, TRPO and PPO algorithms, the proposed algorithm has better performance in most continuous tasks in GYM simulation platform.
作者 陈红名 刘全 闫岩 何斌 姜玉斌 张琳琳 Chen Hongming;Liu Quan;Yan Yan;He Bin;Jiang Yubin;Zhang Linlin(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006;Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou, Jiangsu 215006;Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun, 130012;Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000)
出处 《计算机研究与发展》 EI CSCD 北大核心 2019年第8期1708-1720,共13页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61772355,61702055,61472262,61502323,61502329) 江苏省高等学校自然科学研究重大项目(18KJA520011,17KJA520004) 苏州市应用基础研究计划工业部分项目(SYG201422)~~
关键词 强化学习 深度强化学习 确定性行动者评论家 经验指导 专家指导 多行动者 reinforcement learning deep reinforcement learning deterministic actor-critic experience guiding expert guiding multi-actor
  • 相关文献

参考文献3

二级参考文献51

  • 1魏英姿 ,赵明扬 .强化学习算法中启发式回报函数的设计及其收敛性分析[J].计算机科学,2005,32(3):190-193. 被引量:13
  • 2苏畅,高阳,陈世福,陈兆乾.基于SMDP环境的自主生成options算法的研究[J].模式识别与人工智能,2005,18(6):679-684. 被引量:9
  • 3Barto A G, Mahadevan S. Recent advances in hierarchical reinforcement learning [J]. Discrete Event Dynamic Systems: Theory and Applications, 2003, 13(4): 41-77. 被引量:1
  • 4Sutton R S, Precup D, Singh S P. Between MDPs and semi- MDPs : A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999, 112 (1) : 181-211. 被引量:1
  • 5Dietterich T G. Hierarchical reinforcement learning with the MAXQ value function decomposition[J]. Journal of Artificial Intelligence Research, 2000, 13(1): 227-303. 被引量:1
  • 6Parr R. Hierarchical control and learning for Markov decision processes [D]. Berkeley: University of California, 1998. 被引量:1
  • 7Neville M, Sriraam N. Transfer in variable-reward hierarchical reinforcement learning [J]. Machine Learning, 2008, 73(5): 289-312. 被引量:1
  • 8Schultink E G, Cavallo R. Economic hierarchical Qqearning [C]//Proc of the 23rd AAAI Conf on Artificial Intelligence. New York: ACM, 2008. 被引量:1
  • 9Mannor S, Menache I, Hoze I, et al. Dynamic abstraction in reinforcement learning via clustering [C] //Proc of the 21st Int Conf on Machine Learning. New York: ACM, 2004: 560 -567. 被引量:1
  • 10Stolle M, Precup D. Learning options in reinforcement learning [C]//Proc of the 5th Int Symp on Abstraction, Reformulation and Approximation. Berlin: Springer, 2002: 212-285. 被引量:1

共引文献27

同被引文献54

引证文献6

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部