摘要
【目的】对现有中文分词方法在领域文献上的分词结果进行调整,以提升领域文献上的分词效果。【方法】对传统中文分词方法处理领域文献的不足进行分析,以此为基础设计一个反映领域文献构词特点的分词指标——词频偏差,并基于该指标提出一个无监督的分词结果优化方法。【结果】基于农业领域语料开展实验,结果表明该方法对比ICTCLAS、THULAC和LTP的分词结果 F1值提升2%-3%,并具有实现简单、参数鲁棒性强的特点。【局限】提升召回率方面效果不佳。【结论】基于词频偏差的分词结果优化算法能够有效提升已有分词结果的准确性,且无需领域词表及人工标注语料,具有良好的领域适用性。
[Objective] This paper aims to improve the performance of Chinese word segmentation techniques on domain literature by optimizing results of existing approaches. [Methods] First, we proposed a new criteria of Term Frequency Deviation(TFD) to capture word formation characteristics of domain literature based on the analysis of segmentation errors. Then, we developed an unsupervised segmentation refining approach with the help of TFD. [Results] We examined the proposed approach with agriculture documents. It improved the segmentation results of three popular Chinese word segmentation approaches(i.e., ICTCLAS, THULAC and LTP) by 2%-3% in F1 measure. The proposed approach was easy to use and robustness to parameters. [Limitations] The recall of the proposed approach needs to be improved. [Conclusions] The new Chinese word segmentation approach, which imrpoves the performance of traditional methods on domain literature, could be applied to other fields due to its independence of domain-specific vocabulary and annotated corpus.
作者
倪维健
孙浩浩
刘彤
曾庆田
Ni Weijian ,Sun Haohao ,Liu Tong ,Zeng Qingtian(College of Computer Science and Technology, Shandong University of Science and Technology, Qingdao 266510, Chin)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第2期96-104,共9页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目"面向用户群组的结构化推荐技术及其应用研究"(项目编号:61602278)
"应急预案流程图谱自动建模方法及其在场景式诊断中的应用"(项目编号:71704096)
"农业大数据环境下多粒度知识融合方法研究"(项目编号:31671588)的研究成果之一
关键词
领域文献
中文分词
分词优化
词频偏差
Domain Literature Chinese Word Segmentation Segmentation Refining Term Frequency Deviation