摘要
提出一种改进的Trie树结构,树节点记录了字符串与构词的位置信息,子节点采用哈希查找机制,在此基础上优化了中文分词的正向最大匹配算法。分词过程中利用自动机机制判断是否构成最长词,解决了正向最大匹配算法需要根据词长调整字符串的问题。算法时间复杂度为1.33,对比试验结果表明有较快的分词速度。基于改进Trie树结构的正向最大匹配算法提高了中文分词速度,尤其适用于词典结构需要实时更新的场合。
In this paper we present an improved Trie tree structure,the tree node records the position information of the character in forming a word,the sub-node uses hash searching mechanism,and based on this basis we optimise the forward maximum matching algorithm( FFM) for Chinese word segmentation. In segmentation process we utilise automata mechanism to judge whether the longest word is formed, this solves the problem that the forward maximum matching algorithm requires to adjust the character string according to the length of the word. The time complexity of the algorithm is 1. 33,the contrast experimental results show that there is the faster word segmentation speed. The forward maximum matching algorithm based on the improved Trie tree structure improves the speed of Chinese word segmentation,and is particularly suitable for the situations where the lexicon structure requires real-time update.
出处
《计算机应用与软件》
CSCD
北大核心
2014年第5期276-278,共3页
Computer Applications and Software
基金
海南省教育厅基金项目(Hjkj201137)
三亚市院地合作项目(2011YD19)
关键词
中文信息处理
分词
正向最大匹配算法
Chinese information processing Word segmentation Forward maximum matching algorithm