摘要
目前,汉语并列结构的研究对标注语料的依赖较强,无法利用未标注语料中的语义信息,且未引入半监督学习方法.该文以条件随机场为基本框架,提出了一种基于半监督学习的并列结构识别方法.从未标注语料中训练出词向量继而提取无监督特征,同时引入语言学特征进行对比实验,考察不同特征对并列结构识别效果的影响.实验表明,无监督特征的融入能提高并列结构的识别效果,使F值达到85.75%,语言学特征和无监督特征结合后的F值为85.77%.说明语言学特征对结果的影响甚微,而无监督特征的引入可以减少人工选取特征的工作量,并将语义信息以较简洁的方式融入识别模型中.
Researches on Chinese coordinate structures is currently relied heavily on annotated data without using semantic information in un-annotated data and semi-supervised learning not introduced.A coordinate structures recognition method based on semi-supervised learning is proposed in the framework of conditional random fields(CRF).Word embedding are trained from the unlabeled data and unsupervised features are extracted.Then linguistic features are introduced for comparative experiments to examine the effects of different features on coordinate structures recognition.Experimental results show that the unsupervised features can improve the recognition of coordinate structures and the F-score reach 85.71%,F-score of 85.72%when combined with linguistic and unsuperffised features.The unsupervised features reduce the workload of selecting features manually and incorporate semantic information into the recognition model in a more concise way.
作者
杨丹
邵玉斌
张海玲
龙华
杜庆治
YANG Dan;SHAO Yu-bin;ZHANG Hai-ling;LONG Hua;DU Qing-zhi(School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2021年第9期1818-1825,共8页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61761025)资助。
关键词
并列结构
半监督学习
无监督特征
条件随机场
coordinate structures
semi-superyised learning
unsupervised feature
conditional random field