摘要
现代汉语分词虽已取得较大进展,但是古籍文本分词由于受到古代汉语词汇特征、语义、语法等限制,始终没有形成一种行之有效的方法。通过互信息与邻接熵的新词发现方法从《汉书》中寻找未登录词,结合古代汉语词汇表、古代人名词表和古代地名表构建古籍文本分词词典,以此为基础,使用pyNLPIR对《汉书》进行分词操作。实验结果显示,新词发现方法可以在一定程度上完善古籍文本分词所需的用户词典全面性,但是对3字以上的词语识别效果较差。实验证明使用新词发现结合词典信息的方法对古籍文本进行分词能够有效提高古代汉语分词准确度。
At present,modern Chinese word segmentation has made great progress,but due to the limitations of ancient Chinese vo. cabulary features,semantics,grammar and so on,ancient Chinese text word segmentation has not formed an effective method. In this paper,through the method of mutual information and adjacency entropy to find new words that are not listed in Book of Han,the author combines the ancient Chinese word list,ancient noun list and ancient geographical name list to construct the word segmentation dic. tionary of ancient texts. On this basis,pyNLPIR is used to conduct word segmentation in Book of Han. The Experimental results show that the discovery of new words can improve the comprehensiveness of user dictionaries required for word segmentation of ancient texts to some extent,but the recognition effect of words with more than three words is poor. It shows that the method of word segmentation in ancient Chinese texts by using neologism discovery combined with dictionary information can effectively improve the accuracy of word segmentation in ancient Chinese.
作者
李筱瑜
LI Xiao-yu(College of Economics and Management,Beijing Information Science & Technology University,Beijing 100192,China)
出处
《软件导刊》
2019年第4期60-63,共4页
Software Guide
基金
国家重点研发计划项目(2017YFB1400400)
关键词
古籍文本
分词
互信息
邻接熵
新词发现
ancient texts
word segmentation
mutual information
adjacency entropy
new word discovery