摘要
【目的】为解决基于向量空间文本表示模型中语义信息缺失问题,提出一种基于复杂网络的中文文本表示算法。【方法】利用维基百科中所蕴涵的概念、链接结构和类别体系信息进行词语间相关度的计算,然后以此为基础将文本表示为以特征词为节点、词语相关关系为边及其相关度为权重的加权文本复杂网络。【结果】实验结果表明,该文本表示方法可以提高文本相似度计算结果,改善文本分类效果。【局限】文本网络中共现窗口的选择及跨度的选择规则借鉴的是已有研究。【结论】该文本表示方法可以较好地保留文本的结构信息及词汇间的关联信息,且利用基于维基百科的词语相关度计算方法使文本网络所表示的语义信息更加准确。
[Objective] To solve the problem of the semantic deficiency in text representation based on Vector Space Model, this paper proposes an algorithm of Chinese text representation based on complex network. [Methods] Word relevance is calculated based on the concept pages, link structure and category system which are extracted from Wikipedia. Then, it represents the feature words of texts as nodes, and puts the semantic relevance relation between words as the edges, and uses the word relevance as edge weight of weighted complex network. [Results] Results of experiments show that the proposed text representation method can improve the calculation of text similarity and improve the performance of text categorization. [Limitations] The selection rules of co-occurred window and span in this paper draw lessons from the existing researches. [Conclusions] This text representation method can better keep the structure information and the correlation information between words. Besides, the computation method of word relevance based on Wikipedia makes semantic information represented by the text network more accurate.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第11期38-44,共7页
New Technology of Library and Information Service
基金
国家自然科学基金项目"基于复杂网络的中文文本语义相似度研究"(项目编号:71373200)的研究成果之一