摘要
为了解决跨语言汉越词语相似度计算问题,以维基百科多语言概念页面作为桥梁,利用概念之间存在的翻译对应关系、词语出现在不同概念页面及与其他概念之间存在共现关系,提出了基于维基百科的汉越词语相似度计算方法,该方法首先提取维基百科中汉语越南语具有对应关系的概念集合,构建双语概念特征空间,然后根据词语在相应概念描述文本中出现的词频特征,以及词语与概念在其他概念文本中的共现特征构建词语的概念向量值,最后通过夹角余弦对两个向量进行词语相似度计算。实验结果表明提出的方法在汉越双语词语相似度计算上表现了好的效果,概念共现关系能够提高词语相似度的准确率。
In order to solve the word similarity between language concept description page from Wikipedia as Chinese and Vietnamese, setting the multi- a bridge, using translation correspondence between concepts, words appearing in different concept pages, and the co-occurrence relationship between words and other concepts, the method of calculating the similarity between Chinese- Vietnamese words based on Wikipedia is proposed. The set of Chinese-Vietnamese correspondence concept is extracted from Wikipedia to construct bilingual concept feature space. According to the word frequency features appearing in the corresponding concept text, and the co-occurrence features of words and concepts in other concept texts, we construct the concept vector value of words. The similarity between two vectors is calculated by the angle cosine. The experimental results indicate that the proposed method has good effect on the similarity computation between Chinese and Vietnamese words, and the concept co-occurrence relationship can improve the accuracy of word similarity.
出处
《南京理工大学学报》
EI
CAS
CSCD
北大核心
2016年第4期461-466,共6页
Journal of Nanjing University of Science and Technology
基金
国家自然科学基金(61175068
61472168)
云南省自然科学重点项目(2013FA030)
关键词
汉语
越南语
词语相似度
维基百科
概念
共现关系
对应关系
词频
Chinese
Vietnamese
word similarity
wikipedia
concept
co-occurrence relationship
corresponding relation
word frequency