摘要
针对口语对话系统领域分类任务中传统领域分类方法如SVM需要进行大量人工标注的问题,将LDA(Latent Dirichlet Allocation)模型应用于口语对话系统领域分类;针对口语对话内容少、长度短、数据稀疏等问题,在LDA模型基础上提出了基于词嵌入文本扩充的口语对话系统领域分类方法.该方法主要特点是:1)使用词嵌入方法word2vec对类似于短文本的语音识别后的口语对话文本进行语义扩充,将短文本转化为长文本,使主题模型LDA更加有效地估计口语对话文本的隐含主题;2)采用无监督的概率生成模型LDA对扩充后的口语对话文本进行建模以及领域分类,从而降低人工标注成本.实验结果表明,与直接使用LDA模型进行口语对话系统领域分类方法对比,适当扩充长度的word2vec文本扩充方法在口语对话系统领域分类中的平均准确率、平均召回率和平均F1值分别提高了26.1%、25.5%、27.2%,且该方法具有一定的鲁棒性..
Aiming at the problem of artificial tagging in traditional classification methods such as SVM method in domain classification task of Spoken Dialogue System,LDA(Latent Dirichlet Allocation) model is applied in domain classification of Spoken Dialogue System.Aiming at problems of shot and less words in spoken dialogue text as well as data sparseness,a method of word embedded text extension based in task of domain classification in Spoken Dialogue System is proposed on the basis of LDA model.The main features of the method are as follows:1) using word embedded method,word2 vec,to semantically expand the spoken dialogue text after speech recognition,which is similar to short text,so as to convert it to long text and let LDA model effectively estimate the implied subjects of the spoken dialogue text;2) using unsupervised probability generation model LDA to model and classify the expanded spoken dialogue text so as to decrease the cost of manual annotation.To compare with the method of using LDA model directly,the experimental result shows that the average accuracy,average recall rate and average F1 measure are increased by 26.1%,25.5%and 25.5%respectively as well as robustness for the method of word embedded text extension,word2 vec,in domain classification of Spoken Dialogue System.
出处
《新疆大学学报(自然科学版)》
CAS
北大核心
2016年第2期209-214,220,共7页
Journal of Xinjiang University(Natural Science Edition)
基金
国家自然科学基金(61365005
60965002)
关键词
口语对话系统
口语理解
潜在狄利克雷分布
主题模型
文本扩充
spoken dialogue system
Spoken Language Understanding(SLU)
Latent Dirichlet Allocation(LDA)
topic model
text extension