摘要
在最大匹配法(MM)的基础上,提出了二次回溯中文分词方法。该方法首先对待切文本进行预处理,将文本分割成长度较短的细粒度文本;利用正向匹配、回溯匹配、尾词匹配、碎片检查来有效发现歧义字段;利用长词优先兼顾二词簇的方式对交集型歧义字段进行切分,并对难点的多链长交集型歧义字段进行有效发现和切分。从随机抽取的大量语料实验结果上证明了该方法的有效性。
This paper proposed two times backtracking Chinese word segmentation method based on the MM. The text was pretreatment by the method in the first, then cut the text into shorter lengths granular text. Found ambiguity field effective by forward matching method, backtracking matching, last words matching and debris inspection. Cut crossing ambiguity field by long term priorities and 2-words rules, and found the difficult and multi-linked crossing ambiguity field and cut effectively. The large number of randomly selected language materials being tested and results show that method is effective.
出处
《计算机应用研究》
CSCD
北大核心
2009年第9期3321-3323,共3页
Application Research of Computers
基金
上海市重点学科建设资助项目(T0502)
关键词
中文分词
回溯匹配
交集型歧义
多链长
碎片检查
Chinese word segmentation
backtracking matching
crossing ambiguity
multi-linked
debris inspection