摘要
相似重复记录是数据集成系统中影响数据质量的关键问题之一.为了提高检测精度和效率,综合一些已有的传统方法并加以改进:(1)在字段间进行比较时,根据不同情况逐字符进行比较,使得算法能够适应不同的语言环境,具有较好的通用性.(2)在记录间进行比较时,为不同的字段赋予不同的权重,并采用了基于向量空间模型VSM的向量距离算法,提高了相似重复记录检测的精度.(3)在聚类的过程中采用优先队列策略,减少了记录间比较的次数,提高了检测的效率.理论分析和实验证明文中所提出的相似重复记录检测方法是有效的.
Approximately duplicate records in data integration is one of the key problems affect the data quality. This article presents a synthetic approach for detecting approximately duplicate records. It has three distinctive features: (1)To compare the similarity of two fields, an all-purpose string comparison algorithm is proposed, which can tolerate the multi- language environment. (2)To improve the detecting precision, each field of records is appointed a proper weight and adopted the VSM-based algorithm. (3)An algorithm based on priority queue is proposed. It scans all sorted records sequentially, and makes those approximately duplicate records cluster together through comparing the similarity between current record and the records in the priority queue, it can improve the detecting efficiency. The effectiveness of the proposed approach is verified through analysis and experiment.
出处
《微电子学与计算机》
CSCD
北大核心
2008年第8期184-187,共4页
Microelectronics & Computer
基金
北京市自然科学基金项目(4072018)