摘要
标注数据的获取一直是有监督方法需要面临的一个难题,针对中文口语理解任务中的意图识别研究了结合主动学习和自训练、协同训练两种弱监督训练方法,提出在级联框架下,从关键语义概念识别中获取语义类特征子集和句子本身的字特征子集分别作为两个"视角"的特征进行协同训练。通过在中文口语语料上进行的实验表明:结合主动学习和自训练的方法与被动学习、主动学习相比较,可以最大限度地降低人工标注量;而协同训练在很少的初始标注数据的前提下,利用两个特征子集进行协同训练,最终使得单一字特征子集上的分类错误率平均下降了0.52%。
Annotated corpus acquisition is a difficult problem in supervised approach. Aiming at the intention recognition task of Chinese spoken language understanding, two weakly supervised training approaches were studied. One is combining active learning with self-training, the other is co-training. A new method of acquiring two independent feature sets as two views for co-training was proposed based on spoken language understanding data in cascade frame. The two feature sets were character features of sentence and semantic class features obtained from key semantic concept recognition task. The experimental results on Chinese spoken language corpus show that the method combining active learning with self-training can minimize manual annotation compared with passive learning and active learning. Furthermore, under the premise of a few initial annotation data, co-training based on two feature sets can make the classification error rate fall in an average of 0.52% with single character feature set.
出处
《计算机应用》
CSCD
北大核心
2015年第7期1965-1968,1974,共5页
journal of Computer Applications
基金
国家自然科学基金资助项目(10925419
90920302
61072124
11074275
11161140319
91120001
61271426)
中国科学院战略性先导科技专项(XDA06030100
XDA06030500)
国家863计划项目(2012AA012503)
中国科学院重点部署项目(KGZD-EW-103-2)
内蒙古师范大学"十百千"人才培养工程项目
内蒙古自然科学基金面上项目(2012MS0930
2013MS0912)
内蒙古自治区高等学校科学研究项目(NJZY12032
NJZY028)
内蒙古师范大学引进高层次人才科研启动经费项目(2014YJRC036)
关键词
意图识别
口语理解
弱监督训练
协同训练
主动学习
intention recognition
spoken language understanding
weakly-supervised training
co-training
active learning