摘要
分词词典是信息处理系统的一个基本组成部分,其查询效率将直接影响信息处理系统的性能。根据信息在计算机内都是以二进制编码存放的原理,本文把对字符串的处理转化成对二进制串的处理(支持任何语言的字符串),建立基于Trie索引树的分词词典机制。可以根据不同应用系统需求,自动调整二进制串的长度,建立不同的Trie树结构,便于在存储空间和查询效率之间寻找合适的平衡点。这种基于索引的查询速度与词库中词的多少无关,只与词本身的长度有关系;并且公共的前缀索引值随着词汇量的增大而节省大量内存空间。
The dictionary mechanism serves as one of the basic components in Chinese word segmentation system. Its performance influences the segmentation speed significantly. Based on binary system, processing of text( any language string) can be conver- ted into a binary processing, a mechanism of the Tile index tree dictionary can be created. According to different application re- quirements, the dictionary mechanism can automatically adjust the structure of the Tile which can help to find the right balance between storage space and query efficiency. The query speed of this dictionary mechanism is nothing to do with the total word number of dictionary, only influenced by the word length ; Common prefix of the index value with the increase of vocabulary save a lot of memory space.
出处
《计算机与现代化》
2013年第1期5-7,共3页
Computer and Modernization
关键词
信息处理
分词
词典机制
Trie索引树
Chinese information processing
Chinese word segmentation
dictionary mechanism
Trie