摘要
作为任务型对话系统中的关键一环,对话策略可以通过判别式深度Dyna-Q框架训练得到。然而,该框架在直接强化学习阶段采用原始的深度Q网络方法学习对话策略,在世界模型方面采用多层感知机作为模型的基本结构,导致对话策略的训练效率、性能和稳定性降低。本文提出了一种改进判别式深度Dyna-Q的任务对话策略学习方法。在改进后的直接强化学习阶段,利用噪声网络改进了智能体的探索方式,同时将竞争网络的双流架构、双Q网络与n步自举法三者相结合,优化了Q值的计算过程。在世界模型方面,设计了一种基于软注意力的模型替代多层感知机结构。实验结果表明,本文提出的方法在对话成功率、平均对话轮数以及平均奖励3个指标上均优于现有的最佳结果,最后本文通过消融分析和鲁棒性分析,进一步验证了方法的有效性。
As a pivotal part of the task-oriented dialogue system,dialogue policy can be trained by using the discriminative deep Dyna-Q framework.However,the framework uses vanilla deep Q-network method in the direct reinforcement learning phase and adopts MLPs as the basic network of world model,which limits the efficiency and stability of the dialogue policy learning.In this paper,we purpose an improved discriminative deep Dyna-Q method for task-oriented dialogue policy learning.In the improved direct RL phase,we first employ a NoisyNet to improve the exploration method,and then combine the dual-stream architecture of Dueling Network,Double-Q Network and n-step bootstrapping to optimize the calculation of the Q values.Moreover,we design a soft-attention-based model to replace the MLPs in the world model.The experimental results show that our proposed method achieves better results than other baseline models in terms of task success rate,average dialog turns and average reward.We further validate the effectiveness of proposed method by conducting both ablation and robustness analysis.
作者
戴彬
曾碧
魏鹏飞
黄永健
Dai Bin;Zeng Bi;Wei Peng-fei;Huang Yong-jian(School of Computer Science and Technology,Guangdong University of Technology,Guangzhou 510006,China;Guangzhou Xuanyuan Research Institute Co.,Ltd.,Guangzhou 510000,China)
出处
《广东工业大学学报》
CAS
2023年第4期9-17,23,共10页
Journal of Guangdong University of Technology
基金
国家自然科学基金联合基金资助重点项目(U21A20478)
广东省自然科学基金资助项目(2019A1515011056)
顺德区核心技术攻关项目(2130218003002)。
关键词
任务型对话系统
对话策略学习
强化学习
用户模拟器
task-oriented dialogue system
dialogue policy learning
reinforcement learning
user simulator