摘要
国内现有的中文知识图谱往往以维基百科、百度百科等群体智能贡献的知识库作为资源抽取得到,但这些知识图谱利用的主要是百科的实体名片信息和分类体系信息。然而,这些百科中也有大量的内部链接信息,其中蕴含了大量知识。故而该文中利用维基百科的内部链接构造边,并统计目标实体在源实体定义文本中出现的频度,利用其对应的TF-IDF值作为边权,构造了一个概率式中文知识图谱。该文还提出了一种可信链接筛选算法,对偶发链接进行了去除,使知识图谱更加可信。基于上述方法,该文挖掘出了一个概率式关联可信中文知识图谱,命名为“文脉”,将其在GitHub上进行了开源,以期能对知识指导的自然语言处理以及其他下游任务有所襄助。
The existing Chinese knowledge graphs are derived from Wikipedia and Baidu Baike by leveraging the information of the entity infobox and categorical system.Differently,This article proposes a Chinese knowledge graph with probabilistic links by treat the hyperlinks in these resources as entity relations,weighted by the TF-IDF value of the mention frequency of the target entity in the entry article of the source entity.A reliable link screening algorithm is further desgned to remove the occasional links to make the knowledge graph more reliable.Based on the above methods,this article has constructed a probabilistically probabilistic-like association reliable Chinese knowledge graph named"Wenmai",which is public available in GitHub as a support for knowledge-guided natural language processing.
作者
李文浩
刘文长
孙茂松
矣晓沅
LI Wenhao;LIU Wenchang;SUN Maosong;YI Xiaoyuan(The Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China;Institute for Artificial Intelligence,Tsinghua University,Beijing 100084,China;Beijing National Center for Information Science and Technology,Beijing 100084,China;The Department of Computer Science,University of California,Davis,Davis,CA 95616,USA;Jiangsu Collaborative Innovation Center for Language Ability,Jiangsu Normal University,Xuzhou,Jiangsu 221009,China;Microsoft Research Asia,Beijing 100080,China)
出处
《中文信息学报》
CSCD
北大核心
2022年第12期67-73,共7页
Journal of Chinese Information Processing
基金
国家社会科学基金(18ZDA238)
关键词
维基百科
知识图谱构建
可信链接筛选
Wikipedia
knowledge graph construction
reliable link screening