摘要
经典的相似重复记录检测算法SNM算法随着记录维度的增加,投影过程不仅会导致数据丢失,算法的误差率也会明显增大.针对SNM算法的不足,提出DRR算法,利用R-树构建索引保留记录的高维空间特性,通过聚类减少记录在叶子节点中的比较次数提高效率,同时改进度量记录相似性的距离算法,避免高维数据稀疏性的影响.最后,通过真实数据在不同维度上分别与SNM算法进行对比,验证了算法的有效性.
Abstract: The classic similar duplicate record detection algorithm SNM, With the increase of the recording dimension, the process of projecting can not only lead to the loss of data, but also the error rate of the algorithm will increase obviously. Aiming at the deficiency of SNM algorithm, using R- tree to construct index maintains the high dimension space characteristic of records. By clustering, the times of records comparing was reduced, so that the efficiency was improved. In order to avoid the influence of high dimensional data scarcity, an improved distance algorithm for measuring record similarity is proposed. Finally, the validity of the algorithm is verified by comparing the real data with the SNM algorithm in different dimensions.
出处
《微电子学与计算机》
CSCD
北大核心
2017年第9期97-102,共6页
Microelectronics & Computer
基金
新疆维吾尔自治区重点实验室项目(2016D03019)
新疆维吾尔自治区高技术计划项目(201512103)
中国科学院科技服务网络计划(STS计划)项目(KFJ-EW-STS-129)
关键词
SNM算法
R-树索引
高维空间特性
改进距离算法
数据稀疏性
Key words: SNM algorithm
R- tree index
high dimensional space characteristics
improved distance algorithm
data scarcity