摘要
深度强化学习在学习过程中需要与环境进行大量的交互,训练效率低下。模仿学习通过从专家示范中学习,可以有效地应对这一挑战,但是需要收集大量的专家示范轨迹,在复杂任务中往往导致高昂的示范代价。本文提出一种基于主动学习的行为克隆算法,通过主动挑选示范起始状态来减小示范代价。该方法基于不确定性采样和不相似性采样两种策略,从状态候选集中挑选最有价值的状态作为起始状态,然后向专家查询固定长度的示范轨迹,希望从尽可能少的示范中学习出有效策略。在多个不同任务上的实验表明,本文方法可以用更少的示范轨迹进行行为克隆,降低了强化学习中的专家示范代价。
Deep reinforcement learning has achieved great success in many applications.However,it usually needs large amount of interactions with the environment to learn the policy,which leads to inefficient training.Imitation learning is an important approach to tackle this challenge by learning from demonstrations,but it instead requires a large set of demonstrations provided by experts,which could be rather costly in many complex tasks.In this paper,we propose an active learning method to reduce the demonstration cost by actively selecting starting state for demonstration.The method is based on uncertainty sampling and dissimilarity sampling.It selects the best state from the candidate set and then queries expert for fixed length of trajectory,in order to train effective policy with fewer demonstrations.Experimental results in multiple environments demonstrate that the proposed method can achieve effective performance with significant lower demonstration cost.
作者
黄文宇
黄圣君
HUANG Wenyu;HUANG Shengjun(College of Computer Science and Technology/College of Artificial Intelligence,Nanjing University of Aeronautics&Astronautics,Nanjing 211106,China)
出处
《南京航空航天大学学报》
CAS
CSCD
北大核心
2021年第5期766-771,共6页
Journal of Nanjing University of Aeronautics & Astronautics
基金
航空动力基金(6141B09050342)资助项目。
关键词
强化学习
模仿学习
行为克隆
逆强化学习
主动学习
reinforcement learning
imitation learning
behavioral cloning
inverse reinforcement learning
active learning