摘要
随着现代社会互联网的普及应用,产生的海量数据普遍存在质量问题。针对数据质量中不一致性问题进行研究,设计并实现了基于Hadoop并行平台的不一致数据检测与修复算法。采用数据依赖理论中的条件函数依赖,根据给定规则检测不一致数据集,对这些不一致数据求解修复方案,使得修复结果满足数据一致性要求,并给出修复结果的确定性概率。最后通过实验证明了该算法较已有的单机算法有更好的修复效果,当约束规则较少的情况下,算法执行时间呈线性增长。
With the popularity of the Internet applications in modern society, there comes the problem of increasing poor quality data. This paper investigates inconsistency problem in data quality, designs and realizes an inconsistent data detection and reparation algorithm based on Hadoop. By using the conditional functional dependency(CFD)rules in the data dependency theory, inconsistent data can be detected according to the given rules, and reparation scheme is proposed for the inconsistent data, the final reparation dataset, whose deterministic probability has been calculated, satisfies the consistent requirement. At last, this paper proves that the algorithm performs better than those on a single computer through experiments and the runtime grows linearly when the rules are not large.
出处
《计算机科学与探索》
CSCD
北大核心
2015年第9期1044-1055,共12页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金No.61472099
国家重点基础研究发展计划(973计划)No.2012CB316200
国家科技支撑计划No.2015BAH10F00~~