摘要
[目的/意义]针对当前数据集成场景下,实体识别时未能充分提取文本语义信息导致识别效果不佳以及传统分块方法无法满足高效识别的问题,提出一种考虑语义信息的高效实体识别方法,以提升实体识别的效果与效率。[方法/过程]以需要集成的两个数据集A、B为例,首先,分别对数据集A和B中的所有记录进行分词、去停用词等数据预处理操作,然后基于数据集A中的每一个词,建立数据集A的倒排索引;其次,计算数据集B中记录的每个词在数据集A中的重要度,依据重要度大小选择关键词代表该条记录;最后将关键词与索引词进行比对,基于Sentence-BERT模型依次计算关键词所对应的记录与索引词包含的所有记录之间的相似程度。将超过阈值的记录判定为对同一实体的描述记录,如此往复直至比对完数据集B中的所有记录。[结果/结论]实验结果表明,本文提出的考虑语义信息的高效实体识别方法在精确率、召回率、稳定性和响应时间等评价指标的表现上均优于传统的实体识别方法,为解决数据集成中的实体识别问题提供了方法指导。
[Purpose/significance]In view of the poor recognition effect caused by the failure to fully extract the text semantic information for entity recognition in the current data integration scene,and the problem that the traditional blocking method can not meet the efficient recognition,an efficient entity recognition method considering semantic information is proposed to improve the effect and efficiency of entity recognition.[Method/process]Taking two data sets A and B that need to be integrated as an example,first,this paper performed word segmentation,removed stop words and conducted other data preprocessing operations on all records in data sets A and B respectively,and then established the inverted index of data set A based on each word in data set A;Secondly,in dataset A,the study calculated the importance of each word recorded in dataset B,and selected a keyword to represent the record according to the importance;Finally,the keyword was compared with the index word,and the similarity between the records corresponding to the keywords and all records contained in the index word was calculated successively based on the Sentence-BERT model.The records exceeding the threshold were determined as the description records of the same entity,and so on until all records in set B were compared.[Result/conclusion]The experimental results show that the efficient entity recognition method considering semantic information proposed in this paper is superior to the traditional entity recognition methods in terms of accuracy,recall,stability and response time,which provides method guidance for solving the entity recognition problem in data integration.
作者
宗威
林松涛
刘继昶
Zong Wei;Lin Songtao;Liu Jichang(School of Economics and Management,Xidian University,Xi'an 710126)
出处
《图书情报工作》
CSSCI
北大核心
2022年第14期128-136,共9页
Library and Information Service
基金
国家自然科学基金青年基金项目"面向质量-成本权衡分析的多源异构数据集成一致性问题研究"(项目编号:72001164)
陕西省创新能力支撑计划项目"陕西高校科学数据质量评价体系构建与保障策略研究"(项目编号:2022KRM130)研究成果之一。