摘要
对于一部分目前统计处理消歧效果较差、但出现频率又很高的兼类词,手工编写针对性极强的消歧规则。在未经词汇对齐的平行语料中,实现了基于个性规则的词性消歧方法。本研究为5个典型兼类词(过去、计划、与、back、so)设计的平行消歧算法,在大规模平行语料中得到了验证,平均F值达到了98.45%。研究结果表明该规则具有不受上下文长度和模板数量限制、特别适合于双语平行处理、消歧效果好等优点。
A part-of-speech disambiguation approach was given based on idiosyncratic rules in a parallel corpus unaligned at the lexical level.This approach focused on those words that occurred in the corpus at very high frequency,while the part-of-speeches were difficult to determine.A number of idiosyncratic disambiguation rules were constructed and an algorithm built on these rules was applied on five typical words,among which were three Chinese words,"guoqu","jihua" and "yu" and two English words,"back" and "so".Experiments on a large scale parallel corpus obtained an F-score of 98.45% for the disambiguation of these words,and the results showed that the constructed rules would not be constrained by the length of context and the number of templates.
出处
《山东大学学报(工学版)》
CAS
北大核心
2011年第6期18-23,30,共7页
Journal of Shandong University(Engineering Science)
基金
国家自然科学基金资助项目(60773173
61073119)
江苏省自然科学基金资助项目(BK2010547)
江苏省社会科学基金资助项目(10YYB007)
关键词
平行语料
词性消歧
兼类词
自动识别
中文信息处理
parallel corpus
part of speech disambiguation
words of POS ambiguity
automatic recognition
Chinese information processing