期刊文献+

一种基于VSM的检测相似重复记录的方法 被引量:10

Approach for Detecting Approximately Duplicate Records Based on VSM
下载PDF
导出
摘要 相似重复记录是数据集成系统中影响数据质量的关键问题之一.为了提高检测精度和效率,综合一些已有的传统方法并加以改进:(1)在字段间进行比较时,根据不同情况逐字符进行比较,使得算法能够适应不同的语言环境,具有较好的通用性.(2)在记录间进行比较时,为不同的字段赋予不同的权重,并采用了基于向量空间模型VSM的向量距离算法,提高了相似重复记录检测的精度.(3)在聚类的过程中采用优先队列策略,减少了记录间比较的次数,提高了检测的效率.理论分析和实验证明文中所提出的相似重复记录检测方法是有效的. Approximately duplicate records in data integration is one of the key problems affect the data quality. This article presents a synthetic approach for detecting approximately duplicate records. It has three distinctive features: (1)To compare the similarity of two fields, an all-purpose string comparison algorithm is proposed, which can tolerate the multi- language environment. (2)To improve the detecting precision, each field of records is appointed a proper weight and adopted the VSM-based algorithm. (3)An algorithm based on priority queue is proposed. It scans all sorted records sequentially, and makes those approximately duplicate records cluster together through comparing the similarity between current record and the records in the priority queue, it can improve the detecting efficiency. The effectiveness of the proposed approach is verified through analysis and experiment.
作者 张昌年
出处 《微电子学与计算机》 CSCD 北大核心 2008年第8期184-187,共4页 Microelectronics & Computer
基金 北京市自然科学基金项目(4072018)
关键词 空间向量模型 聚类 相似重复记录 权重 优先队列 VSM clustering approximately duplicate records weight priority queue
  • 相关文献

参考文献8

二级参考文献53

  • 1[1]Bitton D, DeWitt D J. Duplicate record elimination in large data files. ACM Trans Database Systems, 1983, 8(2):255-65 被引量:1
  • 2[2]Hernandez M, Stolfo S. The Merge/Purge problem for large databases. In: Proc ACM SIGMOD International Conference on Management of Data, 1995. 127-138 被引量:1
  • 3[3]Howard B Newcombe, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130:954-959 被引量:1
  • 4[4]DeWitt D J, Naught J F, Schneider D A. An evaluation of non-equijoin algorithms. In: Proc 17th International Conference on Very Large Databases, Barcelona, Spain, 1991. 443-452 被引量:1
  • 5[5]Hylton J A. Identifying and merging related bibliographic records[MS dissertation]. MIT: MIT Laboratory for Computer Science Technical Report 678, 1996 被引量:1
  • 6[6]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proc DMKD'97, Tucson Arizona, 1997 被引量:1
  • 7[7]Kukich K. Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4):377-439 被引量:1
  • 8[8]Wagner R A, Fischer M J. The string-to-string correction problem. J ACM, 1974, 21(1):168-173 被引量:1
  • 9[9]Lowrance R, Robert A Wagner. An extension of the string-to-string correction problem. J ACM, 1975, 22(2):177-183 被引量:1
  • 10[10] Sellers P H. On the theory and computation of evolutionary distances. SIAM J Applied Mathematics, 1974, 26(4):787-793 被引量:1

共引文献337

同被引文献118

引证文献10

二级引证文献63

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部