摘要
近端策略优化(proximal policy optimization, PPO)是从一个已知的分布附近来采样估计另一个分布,通过用新策略在老策略的附近学习来实现优化的,其中老策略作为新策略的近似分布。【目的】针对PPO算法在强化学习中学习效率及收敛性不够好的问题,提出一种改进的PPO算法。【方法】首先提出一种新损失函数来更新PPO算法中的网络参数,采用泛化优势估计(generalized dominance estimation, GAE)对优势函数进行描述;然后采用类似异步优势演员-评论家(asynchronous actor-critic, A3C)算法中的多线程策略来训练智能体;最后设计新的参数更新方式来实现对主副两种网络中的参数更新。【结果】本方法能够使智能体更快地完成学习训练,其训练过程中收敛性更好;由于多线程,其算法的训练速度会比常规的PPO算法至少快5倍。【结论】改进的PPO算法其性能更好,这为后续强化学习算法的研究提供了新思路。
Proximal policy optimization(PPO) is to sample and estimate another distribution from the vicinity of a known distribution, and realize optimization by learning from the vicinity of the old policy with the new one, in which the old policy is the approximate distribution of the new one. [Objective] Aiming at the problem of the undesirable efficiency and convergence of PPO algorithm in reinforcement learning, an improved PPO algorithm was proposed. [Method] Firstly, a new loss function was proposed to update the network parameters in PPO algorithm, and generalized dominance estimation(GAE) was adopted to describe the dominance function;secondly, a multithreading strategy similar to that in the asynchronous advantage actor-critic(A3C) algorithm was used to train agent;finally, a new parameter update method was designed to realize the update of parameters in both primary and secondary networks. [Result] The simulation results show that this method can facilitate the agent’s faster learning and training, with better convergence in its training process;thanks to multithreading, its training speed will be at least 5 times faster than the conventional PPO algorithm. [Conclusion] The performance of the improved algorithm of proximal policy optimization is better, which can provide a new idea for the subsequent study of reinforcement learning algorithm.
作者
费正顺
王焰平
龚海波
项新建
郭峻豪
FEI Zhengshun;WANG Yanping;GONG Haibo;XIANG Xinjian;GUO Junhao(School of Automation and Electrical Engineering,Zhejiang University of Science and Technology,Hangzhou 310023,Zhejiang,China)
出处
《浙江科技学院学报》
CAS
2023年第1期23-29,共7页
Journal of Zhejiang University of Science and Technology
基金
浙江省重点研发计划项目(2018C01085)
浙江省自然科学基金项目(LQ15F030006)
浙江省教育厅科研项目(Y202249418)
浙江科技学院研究生科研创新基金项目(2021yjskc04)。
关键词
强化学习
近端策略优化
泛化优势估计
多线程
reinforcement learning
proximal policy optimization
generalized dominance estimation
multithreading