摘要
【目的】提高对专业术语、名词占比较高的专业领域文本的分词准确度。【方法】提出将词典、统计、深度学习三者有机结合的DBLC模型,并编程实现。获取中国管理案例库中的部分案例作为专业领域语料,将其他几种已有分词模型作为对比对象进行实验与分析。【结果】通过实验得到各模型在实验语料上的分词效果,DBLC模型在各评价指标上均优于其他模型,分词准确率达到96.3%。【局限】未对原词典词与新词做区别处理,没有考虑词典的存储结构问题,模型计算时间复杂度较高。【结论】本文提出的DBLC模型提高了专业领域文本的分词准确度,且该模型分词准确率与词典规模正相关。
[Objective] This paper tries to improve the accuracy of word segmentation for literature with lots of scientific terms. [Methods] First, we programed the DBLC model, which combined the methods of dictionary, statistics and deep learning. Then, we retrieved articles from the Chinese Management Case Center to build the experimental corpus. Finally, we compared the performance of this new model with the existing ones. [Results] The performance of the DBLC model was better than others. Its word segmentation accuracy was up to 96.3%. [Limitations] We did not separate the words of the original dictionary from the new words. We did not re-design the storage structure of the dictionary, which prolonged the computing time of our model. [Conclusions] The proposed DBLC model improves the accuracy of word segmentation, which is also positively co-related to the dictionary size.
作者
冯国明
张晓冬
刘素辉
Feng Guoming;Zhang Xiaodong;Liu Suhui(School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第5期40-47,共8页
Data Analysis and Knowledge Discovery
关键词
中文分词
序列标注
BI-LSTM-CRF
自主学习
基于词典的分词
Chinese Word Segmentation Sequence Labeling BI-LSTM-CRF Autonomous Learning Word Segmentation Based on Dictionary