期刊文献+

一种改进的近端策略优化算法

On an improved algorithm of proximal policy optimization
下载PDF
导出
摘要 近端策略优化(proximal policy optimization, PPO)是从一个已知的分布附近来采样估计另一个分布,通过用新策略在老策略的附近学习来实现优化的,其中老策略作为新策略的近似分布。【目的】针对PPO算法在强化学习中学习效率及收敛性不够好的问题,提出一种改进的PPO算法。【方法】首先提出一种新损失函数来更新PPO算法中的网络参数,采用泛化优势估计(generalized dominance estimation, GAE)对优势函数进行描述;然后采用类似异步优势演员-评论家(asynchronous actor-critic, A3C)算法中的多线程策略来训练智能体;最后设计新的参数更新方式来实现对主副两种网络中的参数更新。【结果】本方法能够使智能体更快地完成学习训练,其训练过程中收敛性更好;由于多线程,其算法的训练速度会比常规的PPO算法至少快5倍。【结论】改进的PPO算法其性能更好,这为后续强化学习算法的研究提供了新思路。 Proximal policy optimization(PPO) is to sample and estimate another distribution from the vicinity of a known distribution, and realize optimization by learning from the vicinity of the old policy with the new one, in which the old policy is the approximate distribution of the new one. [Objective] Aiming at the problem of the undesirable efficiency and convergence of PPO algorithm in reinforcement learning, an improved PPO algorithm was proposed. [Method] Firstly, a new loss function was proposed to update the network parameters in PPO algorithm, and generalized dominance estimation(GAE) was adopted to describe the dominance function;secondly, a multithreading strategy similar to that in the asynchronous advantage actor-critic(A3C) algorithm was used to train agent;finally, a new parameter update method was designed to realize the update of parameters in both primary and secondary networks. [Result] The simulation results show that this method can facilitate the agent’s faster learning and training, with better convergence in its training process;thanks to multithreading, its training speed will be at least 5 times faster than the conventional PPO algorithm. [Conclusion] The performance of the improved algorithm of proximal policy optimization is better, which can provide a new idea for the subsequent study of reinforcement learning algorithm.
作者 费正顺 王焰平 龚海波 项新建 郭峻豪 FEI Zhengshun;WANG Yanping;GONG Haibo;XIANG Xinjian;GUO Junhao(School of Automation and Electrical Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,Zhejiang,China)
出处 《浙江科技学院学报》 CAS 2023年第1期23-29,共7页 Journal of Zhejiang University of Science and Technology
基金 浙江省重点研发计划项目(2018C01085) 浙江省自然科学基金项目(LQ15F030006) 浙江省教育厅科研项目(Y202249418) 浙江科技学院研究生科研创新基金项目(2021yjskc04)。
关键词 强化学习 近端策略优化 泛化优势估计 多线程 reinforcement learning proximal policy optimization generalized dominance estimation multithreading
  • 相关文献

参考文献7

二级参考文献32

  • 1魏英姿 ,赵明扬 .一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. 被引量:19
  • 2高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量:38
  • 3Puterman M L.Markov Decision Process:Discrete Dynamic Dtochastic Programming.New-York:Wiley,1994 被引量:1
  • 4Kaya M,Alhajj R.Fuzzy olap association rules mining based modular reinforcement learning approach for multiagent systems.IEEE Transactions on Systems,Man and Cybernetics part B:Cybernetics,2005,35(2):326-338 被引量:1
  • 5Singh S,Bertsekas D.Reinforcement learning for dynamic channel allocation in cellular telephone systems//Mozer M C,Jordan M L,Petsche T.Proceedings of the NIPS-9.Cambridge MA:MIT Press,1997:974 被引量:1
  • 6Vengerov D N,Berenji H R.A fuzzy reinforcement learning approach to power control in wireless transmitters.IEEE Transactions on Systems,Man,and Cybernetics part B:Cybernetics,2005,35(4):768-778 被引量:1
  • 7Critesl R H,Barto A G.Elevator group control using multiple reinforcement learning Agents.Machine Learning,1998,33(2/3):235-262 被引量:1
  • 8Kaelbling L P,Littman M L,Moore A P.Reinforcement learning:A survey.Journal of Artificial Intelligence Research,1996,4:237-285 被引量:1
  • 9Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge MA:MIT Press,1998 被引量:1
  • 10Schwartz A.A reinforcement learning method for maximizing undiscounted rewards//Huns M N,Singh M P eds.Proceedings of the 10th Annual Conference on Machine Learning.San Francisco:Morgan Kaufmann,1993:298-305 被引量:1

共引文献609

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部