摘要
单篇文本的关键词提取可应用于网页检索、知识理解与文本分类等众多领域。该文提出一种融合图结构与节点关联的关键词提取方法,能够在脱离外部语料库的情况下发现单篇文本的关键词。首先,挖掘文本的频繁封闭项集并生成强关联规则集合;其次,取出强关联规则集合中的规则头与规则体作为节点,节点之间有边当且仅当彼此之间存在强关联规则时,边权重定义为关联规则的关联度,将强关联规则集合建模成关联图;再次,综合考虑节点的图结构属性、语义信息和彼此的关联性,设计一种新的随机游走算法计算节点的重要性分数;最后,为了避免抽取的词项之间有语义包含关系,对节点进行语义聚类并选取每个类的类中心作为关键词提取结果。通过设计关联图模型参数的选取、关键词的提取规模、不同算法对比3个实验,在具有代表性的中英文数据上证明了该方法能够有效提升关键词提取的效果。
Keywords extraction is an important technique for web page retrieval,knowledge comprehension,and document classification,etc.In this paper,a novel keywords extraction method of combining graph structure with nodes association(GSNA)is proposed,which is able to locate keywords without a corpus.Firstly,the frequent closed itemset are exploited and the strong association rules are generated.Secondly,an association graph is constructed based on association rules,where the head and the body of the rules represent nodes,and an edge exists if and only if there is a strong association rule between two nodes and value of lift are adopted to represent weight.Thirdly,three node factors(i.e.graph structure,node semantics and associations)are unified under the same keyword extraction framework for random walking.Finally,a trustworthy sematic clustering algorithm is employed to avoid the semantic overlapping among terms.Three experiments conducted on the Chinese and English data sets show that GSNA is effective for keywords extraction.
作者
马慧芳
王双
李苗
李宁
MA Huifang;WANG Shuang;LI Miao;LI Ning(College of Computer Science and Engineering, Northwest Normal University,Lanzhou,Gansu 730070,China;Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology,Guilin, Guangxi 541004,China;Institute of Information Engineeringt Chinese Academy of Sciences, Beijing 100093,China)
出处
《中文信息学报》
CSCD
北大核心
2019年第9期69-78,共10页
Journal of Chinese Information Processing
基金
国家自然科学基金(61762078,61802404,61363058)
广西可信软件重点实验室研究课题(kx201705)
关键词
关键词提取
随机游走
节点属性
语义信息
节点关联
keywords extraction
random walk
node attribution
semantic information
node association