摘要
汉语自动分词中组合歧义是难点问题,难在两点:组合歧义字段的发现和歧义的消解。本文研究了组合歧义字段在切开与不切时的词性变化规律,提出了一种新的组合歧义字段自动采集方法,实验结果表明该方法可以有效地自动发现组合歧义字段,在1998年1月《人民日报》中就检测到400多个组合歧义字段,远大于常规方法检测到的歧义字段数目。之后利用最大熵模型对60个组合歧义字段进行消歧,考察了六种特征及其组合对消歧性能的影响,消歧的平均准确度达88.05%。
One of challenges in Chinese Word Segmentation is the combinational ambiguity problem with two main obstacles: the detection of combinational ambiguities and ambiguity resolution. This paper investigate the structures of combinational ambiguities and proposes a new approach for automatically detecting this type of ambiguities. The experimental result reveals the approach is effective in the tagged corpus of 1998-01 People Daily with about 1 million words, we have detected more than 400 combinational ambiguities, far more than that detected by common approaches. Then the resolutions of 60 combinational ambiguities are carried out using the maximum entropy model. The effect of six kinds of features, as well as their combination, on the performance of disambiguation is further studies. The average accuracy of disambiguation reaches 88.05 %.
出处
《中文信息学报》
CSCD
北大核心
2007年第1期3-8,共6页
Journal of Chinese Information Processing
基金
教育部语言文字信息管理司"汉语及民文语料库工具软件"资助项目(MZ115-022)
关键词
计算机应用
中文信息处理
汉语切分
组合歧义
最大熵
特征
computer applieation
Chinese information processing
Chinese word segmentation
combinationalambiguity
maximum entropy
feature selection