摘要
强化学习(Reinforcement learning,RL)在围棋、视频游戏、导航、推荐系统等领域均取得了巨大成功.然而,许多强化学习算法仍然无法直接移植到真实物理环境中.这是因为在模拟场景下智能体能以不断试错的方式与环境进行交互,从而学习最优策略.但考虑到安全因素,很多现实世界的应用则要求限制智能体的随机探索行为.因此,安全问题成为强化学习从模拟到现实的一个重要挑战.近年来,许多研究致力于开发安全强化学习(Safe reinforcement learning,SRL)算法,在确保系统性能的同时满足安全约束.本文对现有的安全强化学习算法进行全面综述,将其归为三类:修改学习过程、修改学习目标、离线强化学习,并介绍了5大基准测试平台:Safety Gym、safe-control-gym、SafeRL-Kit、D4RL、NeoRL.最后总结了安全强化学习在自动驾驶、机器人控制、工业过程控制、电力系统优化和医疗健康领域中的应用,并给出结论与展望.
Reinforcement learning(RL)has proved a prominent success in the game of Go,video games,navigation,recommendation systems and other fields.However,a large number of reinforcement learning algorithms cannot be directly transplanted to real physical environment.This is because in the simulation scenario,the agent is able to interact with the environment in a trial-and-error manner to learn the optimal policy.Considering the safety of systems,many real-world applications require the limitation of random exploration behavior of agents.Hence,safety has become an essential factor for reinforcement learning from simulation to reality.In recent years,many researches have been devoted to develope safe reinforcement learning(SRL)algorithms that satisfy safety constraints while ensuring system performance.This paper presents a comprehensive survey of existing SRL algorithms,which are divided into three categories:Modification of learning process,modification of learning objective,and offline reinforcement learning.Furthermore,five experimental platforms are introduced,including Safety Gym,safe-controlgym,SafeRL-Kit,D4RL,and NeoRL.Lastly,the applications of SRL in the fields of autonomous driving,robot control,industrial process control,power system optimization,and healthcare are summarized,and the conclusion and perspective are briefly drawn.
作者
王雪松
王荣荣
程玉虎
WANG Xue-Song;WANG Rong-Rong;CHENG Yu-Hu(School of Information and Control Engineering,China University of Mining and Technology,Xuzhou 221116)
出处
《自动化学报》
EI
CAS
CSCD
北大核心
2023年第9期1813-1835,共23页
Acta Automatica Sinica
基金
国家自然科学基金(62176259,61976215)
江苏省重点研发计划项目(BE2022095)资助。
关键词
安全强化学习
约束马尔科夫决策过程
学习过程
学习目标
离线强化学习
Safe reinforcement learning(SRL)
constrained Markov decision process(CMDP)
learning process
learning objective
offline reinforcement learning