基于优化子目标数的Option-Critic算法被引量：3

Option-Critic Algorithm Based on Sub-Goal Quantity Optimization

下载PDF

导出

摘要时间抽象是分层强化学习中的重要研究方向,而子目标是时间抽象形成的核心元素.目前,大部分分层强化学习需要人工给出子目标或设定子目标数量.然而,在很多情况下,这不仅需要大量的人工干预,而且所作设定未必适合对应场景,在动态环境未知的指导下,这一问题尤为突出.针对此,提出基于优化子目标数的Option-Critic算法(Option-Critic algorithm based on Sub-goal Quantity Optimization,OC-SQO),增加了智能体对环境的探索部分,通过与环境的简单交互,得到适用于应用场景的初始子目标数量估值,并在此基础上识别子目标,然后利用通过策略梯度生成对应的抽象,使用初态、内部策略和终止函数构成的三元组表示,以此进行训练,根据交互得到的抽象改变当前状态,不断迭代优化.OC-SQO算法可以在任意状态下开始执行,不要求预先指定子目标和参数,在执行过程中使用策略梯度生成内部策略、抽象间策略和终止函数,不需要提供内部奖赏信号,也无需获取子目标的情况,尽可能地减少了人工干预.实验验证了算法的有效性. Reinforcement learning has been extensively studied as a branch of machine learning,where an agent keeps interacting with the environment with the goal of getting maximal long-term return,making it prominent in areas such as control and optimal scheduling.Deep reinforcement learning(DRL) is designed to handle large-scale high-dimensional data such as video and image by extracting the abstract representation,and learning an optimall policy through reinforcement learning component.Deep reinforcement learning has become a research hotspot in artificial intelligence and a lot of algorithms have been developed.For example,deep Q Network(DQN)is one of the most famous models in deep reinforcement learning,which is based on convolutional neural network(CNN) and Q-learning algorithm and has been used to learn policy in complex environments with high dimensional inputs.However,the DQN failed to perform well in sparse reward environment or with large-scale state space.Hierarchical reinforcement learning was introduced to solve the aforementioned problems where the initial problem space is decomposed into several sub-problem spaces,and the initial large problem is solving by meaning of dealing with each sub-problem individually.However,hierarchical reinforcement learning tends to be effective in tasks with discrete state/action space.The idea of hierarchical deep reinforcement learning,by combining hierarchical reinforcement learning with deep learning,is similar to that of hierarchical reinforcement learning,where it solves sub-problems through the neural network.Time abstraction is an important concept in hierarchical reinforcement learning,and the sub-goal is the key for producing time abstraction.Time abstraction,as one of the most promising areas of hierarchical reinforcement learning,requires the notion of sub-goal as the prerequisite.At present,however,sub-goals or the number of sub-goals must be manually specified,which is in short of automation and generalization across different scenarios.To solve the problem,we pr

作者刘成浩朱斐刘全 LIU Cheng-Hao;ZHU Fei;LIU Quan(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;Provincial Key Laboratory for Computer Information Processing Technology(Soochow University),Suzhou,Jiangsu 215006)

机构地区苏州大学计算机科学与技术学院苏州大学江苏省计算机信息处理技术重点实验室

出处《计算机学报》 EI CAS CSCD 北大核心 2021年第9期1922-1933,共12页 Chinese Journal of Computers

基金国家自然科学基金项目(61303108,61772355) 江苏省高校自然科学研究项目重大项目(17KJA520004) 苏州市重点产业技术创新-前瞻性应用研究项目(SYG201804) 江苏高校优势学科建设工程资助项目(PAPD)资助。

关键词分层深度强化学习时间抽象子目标强化学习 OPTION hierarchical deep reinforcement learning time abstraction sub-goal reinforcement learning Option

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

同被引文献12

1韦炜,全渝娟,卓奕涛,陈学亮,林艳.基于多阶马尔可夫预测的个性化推荐算法[J].计算机工程,2015,41(11):59-66. 被引量：9
2周文吉,俞扬.分层强化学习综述[J].智能系统学报,2017,12(5):590-594. 被引量：20
3刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进.深度强化学习综述[J].计算机学报,2018,41(1):1-27. 被引量：456
4刘建伟,高峰,罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报,2019,42(6):1406-1438. 被引量：127
5朱斐,吴文,伏玉琛,刘全.基于双深度网络的安全深度强化学习方法[J].计算机学报,2019,42(8):1812-1826. 被引量：26
6梁星星,冯旸赫,黄金才,王琦,马扬,刘忠.基于自回归预测模型的深度注意力强化学习方法[J].软件学报,2020,31(4):948-966. 被引量：17
7张明悦,金芝,赵海燕,罗懿行.机器学习赋能的软件自适应性综述[J].软件学报,2020,31(8):2404-2431. 被引量：8
8周运腾,张雪英,李凤莲,刘书昌,焦江丽,田豆.Q-learning算法优化的SVDPP推荐算法[J].计算机工程,2021,47(2):46-51. 被引量：3
9周瑞朋,秦进.基于最佳子策略记忆的强化探索策略[J].计算机工程,2022,48(2):106-112. 被引量：1
10宋健,王子磊.基于值分解的多目标多智能体深度强化学习方法[J].计算机工程,2023,49(1):31-40. 被引量：4

引证文献3

1乌兰,刘全,黄志刚,朱斐,张立华.优势加权互信息最大化的最大熵分层强化学习[J].计算机学报,2023,46(10):2066-2083.
2栗军伟,刘全,徐亚鹏.基于互信息优化的Option-Critic算法[J].计算机科学,2024,51(2):252-258.
3张斯力,李梓健,蔡瑞初,郝志峰,闫玉光.基于因果机制约束的强化推荐系统[J].计算机工程,2024,50(5):279-290.

1吴宜珈,赖俊,陈希亮,曹雷,徐鹏.强化学习算法在超视距空战辅助决策上的应用研究[J].航空兵器,2021,28(2):55-61. 被引量：13
2刘用场,杨军,杨艳,王启超.科技助力福建农业产业发展与农民增收长效机制研究[J].农业科技管理,2021,40(4):21-24. 被引量：4
3潘昕,冯国利,侯新国.基于分层强化学习的AUV路径跟踪技术研究[J].海军工程大学学报,2021,33(3):106-112. 被引量：2
4段新存.16排CT诊断肠壁缺血性病变的诊断价值及临床作用探讨[J].世界最新医学信息文摘,2021(16):237-238.
5杨全顺,尹洋,陈帅.基于强化学习的反水雷无人艇局部路径规划[J].电光与控制,2021,28(7):11-15. 被引量：3
6Rong Li,Kerning Yun,Guoli Yin,Ling Li,Zhao Liu,Xiang Zhang,Ping Yan,Tiantong Yang.Importance and Guidelines of Postmortem Examination on COVID-19 Cases:An Overview[J].Journal of Forensic Science and Medicine,2020,6(3):93-97.
7张玥,亓雪,刘湘,公茂旺,赵德杰.血小板参数、纤维蛋白原和D-二聚体对深静脉血栓形成中不同中医证型的诊断价值[J].国际中医中药杂志,2021,43(8):751-756. 被引量：10
8杨建华,韩梦莹.考虑碳税对备件联合订购决策影响的研究[J].中国管理科学,2021,29(7):23-32. 被引量：2
9周靖,刘煜,霍林生.基于机器视觉的螺栓松动旋转角度测量[J].机械设计与研究,2021,37(4):159-162. 被引量：17
10陆永亚,黄金根,李倩.思林升船机平衡重挂装技术及控制方法研究[J].人民长江,2021,52(S01):172-175.

计算机学报

2021年第9期

浏览历史

内容加载中请稍等...

基于优化子目标数的Option-Critic算法被引量：3

同被引文献12

引证文献3

相关作者

相关机构

相关主题

浏览历史

基于优化子目标数的Option-Critic算法 被引量：3

同被引文献12

引证文献3

相关作者

相关机构

相关主题

浏览历史

基于优化子目标数的Option-Critic算法被引量：3