摘要
近年来,如何生成具有泛化能力的策略已成为深度强化学习领域的热点问题之一,并涌现出了许多相关的研究成果,其中的一个代表性工作为广义值迭代网络.广义值迭代网络是一种可作用于非规则图形的规划网络模型.它利用一种特殊的图形卷积算子来近似地表示状态转移矩阵,使得其在学习到非规则图形的结构信息后,可通过值迭代过程进行规划,从而在具有非规则图形结构的任务中产生具有泛化能力的策略.然而,由于没有考虑根据状态重要性来合理分配规划时间,广义值迭代网络中的每一轮迭代都需要在整个状态空间的所有状态上同步执行.当状态空间较大时,这样的同步更新会降低网络的规划性能.用异步更新的思想来进一步研究广义值迭代网络.通过在值迭代过程中定义状态优先级并执行异步值更新,提出了一种新型的异步规划网络模型——广义异步值迭代网络.在未知的非规则结构任务中,与广义值迭代网络相比,广义异步值迭代网络具有更高效且更有效的规划过程.进一步地,改进了广义值迭代网络中的强化学习算法及图形卷积算子,并通过在非规则图形和真实地图中的路径规划实验验证了改进方法的有效性.
In recent years,how to generate policies with generalization abilities has become one of the hot issues in the field of deep reinforcement learning,and many related research achievements have appeared.One representative work among them is generalized value iteration network(GVIN).GVIN is a differential planning network that uses a special graph convolution operator to approximately represent a state-transition matrix,and uses the value iteration(VI)process to perform planning during the learning of structure information in irregular graphs,resulting in policies with generalization abilities.In GVIN,each round of VI involves performing value updates synchronously at all states over the entire state space.Since there is no consideration about how to rationally allocate the planning time according to the importance of states,synchronous updates may degrade the planning performance of network when the state space is large.This work applies the idea of asynchronous update to further study GVIN.By defining the priority of each state and performing asynchronous VI,a planning network is proposed,it is called generalized asynchronous value iteration network(GAVIN).In unknown tasks with irregular graph structure,compared with GVIN,GAVIN has a more efficient and effective planning process.Furthermore,this work improves the reinforcement learning algorithm and the graph convolutional operator in GVIN,and their effectiveness are verified by path planning experiments in irregular graphs and real maps.
作者
陈子璇
章宗长
潘致远
张琳婧
CHEN Zi-Xuan;ZHANG Zong-Zhang;PAN Zhi-Yuan;ZHANG Lin-Jing(State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China;School of Computer Science and Technology,Soochow University,Suzhou 215006,China)
出处
《软件学报》
EI
CSCD
北大核心
2021年第11期3496-3511,共16页
Journal of Software
基金
国家自然科学基金(61876119)
江苏省自然科学基金(BK20181432)
中央高校基本科研业务费专项资金(022114380010)。
关键词
深度学习
强化学习
模仿学习
规划
异步更新
deep learning
reinforcement learning
imitation learning
planning
asynchronous update