摘要
以现代哈萨克语短语识别与短语块库构建技术研究工程为背景,以NP和VP结构的歧义类型研究及消除为目的,提取统计方法来处理NP和VP结构的歧义问题.该方法在已经统计与分析出的哈萨克语短语基础上,对哈萨克语NP和VP短语组合结构歧义做全面分析和整理.用互信息方法解决NP和VP的歧义问题准确率(72%)并不高.为了达到更好的准确率就需要数量较大的训练语料库,但是目前实验环境并没有足够的语料.因此,基于规则方法标注好语料并采用人工方式完善训练语料库,再使用最大熵方法来处理歧义问题.实验结果表明,基于统计方法解决NP和VP结构的歧义问题是有效的,其准确率在封闭测试中达到了80.1%.
This paper aims to building modern Kazakh phrase recognition chunk library in technology research,and use statistical methods to solve the problem that NP and VP ambiguous structure the ambiguity.In this method statistics and Analysis Kazakh phrase structure knowledge,and analysis the Kazakh NP and VP phrase combination of structural ambiguity in more comprehensive system.The study has applied mutual information processing NP VP of ambiguity,but this algorithm is the accuracy rate is not high only 72%.To achieve better accuracy need a large number of training corpus in order,but the experimental environment and do not have enough corpus.Therefore,based on a small number of rule-based method marked corpus and added artificially to improve the training corpus,and then in the maximum entropy method to deal with the ambiguity problem.Experimental results show that:statistics-based approach to solve NP and VP structure ambiguity is valid,closed test accuracy of 80.1%.
出处
《西南师范大学学报(自然科学版)》
CAS
CSCD
北大核心
2014年第7期41-46,共6页
Journal of Southwest China Normal University(Natural Science Edition)
基金
新疆维吾尔自治区多语种信息技术实验室开放课题资助项目(XJDX0905-2013-03)
关键词
哈萨克语
最大熵模型
NP
VP搭配
歧义消除
Kazakh
maximum entropy
NP and VP-collocation
attachment disambiguation