一种改进的近端策略优化算法

On an improved algorithm of proximal policy optimization

下载PDF

导出

摘要近端策略优化(proximal policy optimization, PPO)是从一个已知的分布附近来采样估计另一个分布,通过用新策略在老策略的附近学习来实现优化的,其中老策略作为新策略的近似分布。【目的】针对PPO算法在强化学习中学习效率及收敛性不够好的问题,提出一种改进的PPO算法。【方法】首先提出一种新损失函数来更新PPO算法中的网络参数,采用泛化优势估计(generalized dominance estimation, GAE)对优势函数进行描述;然后采用类似异步优势演员-评论家(asynchronous actor-critic, A3C)算法中的多线程策略来训练智能体;最后设计新的参数更新方式来实现对主副两种网络中的参数更新。【结果】本方法能够使智能体更快地完成学习训练,其训练过程中收敛性更好;由于多线程,其算法的训练速度会比常规的PPO算法至少快5倍。【结论】改进的PPO算法其性能更好,这为后续强化学习算法的研究提供了新思路。 Proximal policy optimization(PPO) is to sample and estimate another distribution from the vicinity of a known distribution, and realize optimization by learning from the vicinity of the old policy with the new one, in which the old policy is the approximate distribution of the new one. [Objective] Aiming at the problem of the undesirable efficiency and convergence of PPO algorithm in reinforcement learning, an improved PPO algorithm was proposed. [Method] Firstly, a new loss function was proposed to update the network parameters in PPO algorithm, and generalized dominance estimation(GAE) was adopted to describe the dominance function;secondly, a multithreading strategy similar to that in the asynchronous advantage actor-critic(A3C) algorithm was used to train agent;finally, a new parameter update method was designed to realize the update of parameters in both primary and secondary networks. [Result] The simulation results show that this method can facilitate the agent’s faster learning and training, with better convergence in its training process;thanks to multithreading, its training speed will be at least 5 times faster than the conventional PPO algorithm. [Conclusion] The performance of the improved algorithm of proximal policy optimization is better, which can provide a new idea for the subsequent study of reinforcement learning algorithm.

作者费正顺王焰平龚海波项新建郭峻豪 FEI Zhengshun;WANG Yanping;GONG Haibo;XIANG Xinjian;GUO Junhao(School of Automation and Electrical Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,Zhejiang,China)

机构地区浙江科技学院自动化与电气工程学院

出处《浙江科技学院学报》 CAS 2023年第1期23-29,共7页 Journal of Zhejiang University of Science and Technology

基金浙江省重点研发计划项目(2018C01085) 浙江省自然科学基金项目(LQ15F030006) 浙江省教育厅科研项目(Y202249418) 浙江科技学院研究生科研创新基金项目(2021yjskc04)。

关键词强化学习近端策略优化泛化优势估计多线程 reinforcement learning proximal policy optimization generalized dominance estimation multithreading

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献7

1高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
2刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：484
3杨霄,李晓婷.基于深度强化学习的自动驾驶技术研究[J].网络安全技术与应用,2021(1):136-138. 被引量：6
4苏壮..基于路径的移动预测及路线规划研究[D].北京邮电大学,2020:
5万里鹏,兰旭光,张翰博,郑南宁.深度强化学习理论及其应用综述[J].模式识别与人工智能,2019,32(1):67-81. 被引量：70
6杜威,丁世飞.多智能体强化学习综述[J].计算机科学,2019,46(8):1-8. 被引量：51
7郭峰..电商平台搜索广告的转化率提升研究[D].中国矿业大学,2019:

二级参考文献32

1魏英姿 ,赵明扬 .一种基于强化学习的作业车间动态调度方法[J].自动化学报,2005,31(5):765-771. 被引量：19
2高阳,周如益,王皓,曹志新.平均奖赏强化学习算法研究[J].计算机学报,2007,30(8):1372-1378. 被引量：38
3Puterman M L.Markov Decision Process:Discrete Dynamic Dtochastic Programming.New-York:Wiley,1994 被引量：1
4Kaya M,Alhajj R.Fuzzy olap association rules mining based modular reinforcement learning approach for multiagent systems.IEEE Transactions on Systems,Man and Cybernetics part B:Cybernetics,2005,35(2):326-338 被引量：1
5Singh S,Bertsekas D.Reinforcement learning for dynamic channel allocation in cellular telephone systems//Mozer M C,Jordan M L,Petsche T.Proceedings of the NIPS-9.Cambridge MA:MIT Press,1997:974 被引量：1
6Vengerov D N,Berenji H R.A fuzzy reinforcement learning approach to power control in wireless transmitters.IEEE Transactions on Systems,Man,and Cybernetics part B:Cybernetics,2005,35(4):768-778 被引量：1
7Critesl R H,Barto A G.Elevator group control using multiple reinforcement learning Agents.Machine Learning,1998,33(2/3):235-262 被引量：1
8Kaelbling L P,Littman M L,Moore A P.Reinforcement learning:A survey.Journal of Artificial Intelligence Research,1996,4:237-285 被引量：1
9Sutton R S,Barto A G.Reinforcement Learning:An Introduction.Cambridge MA:MIT Press,1998 被引量：1
10Schwartz A.A reinforcement learning method for maximizing undiscounted rewards//Huns M N,Singh M P eds.Proceedings of the 10th Annual Conference on Machine Learning.San Francisco:Morgan Kaufmann,1993:298-305 被引量：1

共引文献609

1傅汇乔,唐开强,邓归洲,王鑫鹏,陈春林.基于深度强化学习的六足机器人运动规划[J].智能科学与技术学报,2020(4):361-371. 被引量：3
2刘朝阳,穆朝絮,孙长银.深度强化学习算法与应用研究现状综述[J].智能科学与技术学报,2020(4):314-326. 被引量：48
3韩志豪,汪益兵,张宇,郝永志.基于深度强化学习的船舶航线自动规划[J].中国航海,2021,44(1):100-105. 被引量：9
4周宏宇,王小刚,赵亚丽,崔乃刚.组合动力运载器上升段轨迹智能优化方法[J].宇航学报,2020,41(1):61-70. 被引量：11
5张磊,母亚双,潘泉.基于改进深度双Q网络的移动机器人路径规划算法[J].信息与控制,2024,53(3):365-376. 被引量：1
6李茹杨,彭慧民,李仁刚,赵坤.强化学习算法与应用综述[J].计算机系统应用,2020,29(12):13-25. 被引量：46
7钟玮琦,喻仁虹,李明柱.基于DDPG算法的供热末端运行策略研究[J].暖通空调,2022,52(S02):170-174. 被引量：1
8Di Cao,Weihao Hu,Junbo Zhao,Guozhou Zhang,Bin Zhang,Zhou Liu,Zhe Chen,Frede Blaabjerg.Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review[J].Journal of Modern Power Systems and Clean Energy,2020,8(6):1029-1042. 被引量：29
9周瑶瑶,李烨.基于排序优先经验回放的竞争深度Q网络学习[J].计算机应用研究,2020,37(2):486-488. 被引量：8
10李逊,李俊超,邓林忠,康旭云,欧启捷,劳恒辉.人工智能优化技术在钢筋混凝土结构的应用[J].建筑结构,2023,53(S02):1425-1430. 被引量：1

1高瑞玮,叶青,徐小玲,刘雯,韩楠,杨国平,徐康镭.基于多线程通信机制的云数据库查询优化方法[J].无线电工程,2023,53(2):271-280. 被引量：2
2李宪秀,何涛,杨帆,汪翀,周怡,沙如意,毛建卫.食叶草的营养活性成分含量及生物活性分析[J].食品工业科技,2023,44(3):307-315. 被引量：8
3杨智杰,王蕾,石伟,彭凌辉,王耀,徐炜遐.类脑处理器异步片上网络架构[J].计算机研究与发展,2023,60(1):17-29. 被引量：1
4张启阳,陈希亮,张巧.基于轨迹感知的稀疏奖励探索方法[J].计算机科学,2023,50(1):262-269.
5赵珊,杨飞洋,秦琳,李曦,黄世群,郑幸果,雷欣宇,仲伶俐.代用茶茶汤中功能性成分组成及含量分析[J].食品研究与开发,2023,44(2):162-168. 被引量：4
6朱文凯,周星,刘亚杰,张涛,宋元明.基于递推门控循环单元神经网络的锂离子电池荷电状态实时估计方法[J].储能科学与技术,2023,12(2):570-578. 被引量：4
7袁立宁,刘钊.基于One-Shot聚合自编码器的图表示学习[J].计算机应用,2023,43(1):8-14. 被引量：2
8乔勇,葛昌帅,张天兴,鲁晓峰.基于Actor-Critical架构的5G Massive MIMO波束能效的研究与应用[J].通信技术,2022,55(12):1642-1649.
9马松玲,陈起源,康佳欢.基于强化学习的变电站巡检路径规划算法[J].计算机仿真,2022,39(12):103-107. 被引量：4
10张文龙,张洁.基于A3C的有序充电算法[J].计算机技术与发展,2023,33(1):173-177.

浙江科技学院学报

2023年第1期

浏览历史

内容加载中请稍等...

一种改进的近端策略优化算法

参考文献7

二级参考文献32

共引文献609

相关作者

相关机构

相关主题

浏览历史