摘要
【目的】探讨实体解析理论中经典的实体解析方法及逻辑思路。【文献范围】在GoogleScholar和CNKI中分别以检索词"Entity Resolution"、"Collective Analysis"、"Crowdsourced"、"Active Learning"、"Privacy-Preserving"和"实体解析"进行文献检索,再结合主题筛选,精读并使用追溯法获得实体解析研究的代表性文献共86篇。【方法】针对每种实体解析方法,归纳分析该方法的基本思想,并通过图示直观地呈现其中的解析过程;重点分析梳理方法实现过程中,现有研究所采用的关键策略、算法或技术等。【结果】实体解析是数据质量管理的基本操作,也是发现数据价值的关键步骤。【局限】未深入分析各实体解析方法的评价指标和应用情况。【结论】尽管现有实体解析方法能在一定程度上满足大部分应用的需求,但在大数据环境下其仍然面临着数据混杂性、隐私保护和分布式环境等方面的挑战。
[Objective] This paper discusses the classical entity resolution methods and logical thinking in entity resolution theory.[Coverage] Google Scholar and CNKI were respectively used to search literatures with the keywords"Entity Resolution","Collective Analysis","Crowdsourced","Active Learning","Privacy-Preserving" and "Entity Resolution" in Chinese. I then obtained a total of 86 representative literatures in conjunction with topic screening,intensive reading and retrospective method.[Methods] For each entity resolution method, the paper first summarizes and analyzes the basic idea of the method, and presents the resolution process through illustration, and then focuses on analyzing the key strategies, algorithms or techniques adopted by the existing research in the process of implementation of the method.[Results] Entity resolution is the basic operation of data quality management, and the key step to find the value of data.[Limitations] There is no in-depth analysis of the evaluation indicators and application of each entity resolution method.[Conclusions] Although existing entity resolution methods can meet the requirements of most applications to some extent, they still face challenges in data heterogeneity, privacy protection and distributed environment in the big data environment.
作者
高广尚
Gao Guangshang(Business School,Guilin University of Technology,Guilin 541004,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2019年第5期27-40,共14页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目“面向数据演化的增量实体解析方法研究”(项目编号:71761008)
广西高校人文社会科学重点研究基地基金项目“面向企业数据治理的数据质量改善研究”(项目编号:16YB010)的研究成果之一
关键词
实体解析
协同分析
众包
主动学习
隐私保护
Entity Resolution
Collective Analysis
Crowdsourced
Active Learning
Privacy-Preserving