摘要
随着认知计算的飞速发展,通用知识图谱的自动构建取得了极大的进步,但在垂直领域由于缺乏本体等语义信息,导致进展缓慢。叙词表广泛分布于各个专业领域且蕴藏着丰富的语义信息,如能对这些语义信息进行合理的提取和利用,必然能在一定程度上帮助领域知识图谱的自动构建。该文提出两个假设,利用假设可以从叙词表内部结构中提取实体类型和关系类型,进而设计了一种基于叙词表的领域知识图谱初始种子集自动生成算法。最后,以地质领域和林业领域的叙词表作为实验对象,采用Bootstrapping算法,利用由叙词表自动生成的初始种子集进行抽取工作,通过对抽取到的结果进行分析,结果表明利用叙词表得到的初始种子集可以取得同人工设计种子比较接近的效果。此外,所提模型具有通用性,为叙词表在构建领域知识图谱中的应用提供了一种新的思路。
With the rapid development of cognitive computing,the automatic construction of general knowledge graph has made great progress.However,it improves slowly in the vertical domain due to the lack of semantic information like ontology and others.In addition,the thesaurus is widespread in various domains with abundant semantic information.If the semantic information can be extracted and utilized reasonably,the effects of domain knowledge graph automatic establishment can be improved.In this paper,we propose two hypotheses,which can be used to extract the entity type and the relationship type from the internal structure of the thesaurus.And then we design an initial seed set automatic generation algorithm for domain knowledge graph based on thesaurus.Finally,the initial seed set generated by the geology and forestry domain thesaurus are used as the input of Bootstrapping algorithm for extraction.Experimental results demonstrate that the initial seed set obtained by the thesaurus are close to artificially designed seed set.In addition,the proposed model can be applied generally and provide a new idea for the application of the thesaurus for domain knowledge graph construction.
作者
韩其琛
赵亚伟
姚郑
付立军
HAN Qichen;ZHAO Yawei;YAO Zheng;FU Lijun(Big Data Analysis Technology Laboratory,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《中文信息学报》
CSCD
北大核心
2018年第8期1-8,共8页
Journal of Chinese Information Processing
基金
国家自然科学基金(61072091)
中国科学院信息化专项建设项目(XXH12502)