摘要
随着互联网技术的发展和应用,Web数据量越来越大,在Web数据集成中,实体解析作为其中的重要环节,其主要任务是将不同Web数据源中指向现实世界同一实体的记录识别出来。然而这些数据往往都来自于不同的数据源,存在着数据重复等问题。为了解决特定领域的实体解析问题通常采用自定义模糊连接的方式来解决。但是目前较为先进的模糊连接技术诸如前缀过滤技术等均不支持转换规则的定制,并且表现出较差的性能和可扩展性。为了解决特定领域的数据重复问题,提升实体解析算法的可扩展性,本文引入了一种基于自定义转换规则的模糊连接技术来提升算法的可扩展性;采用基于局部敏感哈希映射的签名方案来获得签名,与前缀过滤相比,通过局部敏感哈希映射产生的签名更具有代表性,能够对局部敏感哈希映射中高频出现的签名进行剪枝来显著地减少需要匹配的次数;最后通过集合相似性来判断是否为重复实体,并利用实际地产领域数据集验证了算法的有效性。
With the development and application of Internet technology, the amount of Web data is increasing. In Web data integration, entity resolution is an important link. Its main task is to identify records from different Web data sources that point to the same entity in the real world. However, these data often come from different data sources, and there are problems such as data duplication. In order to solve the problem of entity resolution in a specific domain, a custom fuzzy join is usually used to solve it. However, the more advanced fuzzy join technologies such as prefix filtering technology do not support the customization of transformation rules, and show poor performance and scalability. In order to solve the problem of data duplication in specific fields, improve the scalability of entity resolution algorithms, in this paper, a fuzzy join technology based on custom conversion rules is introduced to improve the scalability of the algorithm;a signature scheme based on locality-sensitive hash mapping is used to obtain signatures. The signature is more representative, and can prune the signatures that appear frequently in the local sensitive hash map to significantly reduce the number of matching;finally, it is determined whether it is a duplicate entity through the similarity of the set, and the actual real estate field data set is used to verify the effectiveness of the algorithm.
出处
《数据挖掘》
2022年第3期280-296,共17页
Hans Journal of Data Mining