摘要
中文分词是自然语言处理的基础。交叉型歧义是提高中文分词精度的瓶颈之一。文章提出一种基于正向、负向最大匹配算法和passive aggressive(PA)算法结合的交叉型歧义消解算法。基于PA算法训练分词模型;利用正向、负向最大匹配算法检测交叉型歧义的位置;把可能出现交叉型歧义的句子或者句子的部分传递给分词模型,解码得到分词结果;最后,把正向、负向最大匹配结果和分词模型解码结果拼接成最终的分词结果。利用PA算法基于2014年2—12月份人民日报数据训练分词模型、2014年1月份人民日报数据作为测试语料进行实验,得到交叉型歧义的准确率、召回率和F-score分别为98. 32%、98. 14%和98. 23%,说明该方法有效可行。
Chinese word segmentation is the foundation of natural language processing, and cross ambiguity is one of the bottlenecks to improve the accuracy of Chinese word segmentation. This paper proposes a method combining max- imunl matching algorithm and passive aggressive ( PA ) algorithm to eliminate cross ambiguity. Firstly, segmentation model was trained based on PA. Secondly, we checked the position of cross ambiguity based on forward maxinmnl matching algorithm and negative maximum matching algorithm. Thirdly, the position of cross ambiguity and the context were submitted to the segmentation model, and they were decoded. Lastly, the final result was obtained. The experi- ment results on Renmin Daily 2014 show flint the precision, recall and F - score of cross ambiguity are 98.32% ,98. 14% and 98.23% respectively.
作者
甘蓉
GAN Rong(School of Automotive Engineering,Shanxi Polytechnic Institute,Xianyang 712000 China)
出处
《西华大学学报(自然科学版)》
CAS
2018年第6期32-36,共5页
Journal of Xihua University:Natural Science Edition
关键词
中文分词
交叉型歧义
最大匹配算法
PA算法
Chinese word segmentation
cross ambiguity
maximum matching algorithm
passive aggressive algo-rithm