摘要
该文针对最大熵原理只能利用上下文中的显性统计特征构建语言模型的特点,提出了采用隐最大熵原理构建汉语词义消歧模型的方法。在研究了《知网》中词语与义原之间的关系之后,把从训练语料获取的文本上下文中的词语搭配信息转换为义原搭配信息,实现了基于义原搭配信息的文本隐性语义特征提取方法。在结合传统的上下文特征后,应用隐最大熵原理进行文本中多义词的词义消歧。实验结果表明,采用文中所提方法对十个多义动词进行词义消歧,正确率提高了约4%。
We present a new approach to Chinese word sense disambiguation based on latent maximum entropy principle(LME),which is different from Jaynes' maximum entropy principle that only use the context statistical characteristics to construct language model.After studying the relationship between the word and the sememe in Hownet,we convert the word collocation that obtained from the context of training corpus into the sememe collocation,and realize the extraction of text latent semantic features based on sememe collocations.Combined with the traditional context features,the latent maximum entropy principle is applied to disambiguate polysemy words.Experimental results show that the method proposed improves the accuracy by about 4% in the sense disambiguation of 10 polysemous verbs word.
出处
《中文信息学报》
CSCD
北大核心
2012年第3期72-78,共7页
Journal of Chinese Information Processing
基金
国家自然科学基金资助项目(60873013
61070119)
北京大学计算语言学教育部重点实验室开放课题基金资助项目(KLCL-1005)
北京市属市管高等学校人才强教计划资助项目(PHR201007131)
关键词
隐最大熵原理
文本隐性特征
义原搭配信息
词义消歧
latent maximum entropy principle
text latent features
sememes collocation information
word sense disambiguation