期刊文献+

大数据上基于Hadoop的不一致数据检测与修复算法 被引量:13

Hadoop-Based Inconsistence Detection and Reparation Algorithm for Big Data
下载PDF
导出
摘要 随着现代社会互联网的普及应用,产生的海量数据普遍存在质量问题。针对数据质量中不一致性问题进行研究,设计并实现了基于Hadoop并行平台的不一致数据检测与修复算法。采用数据依赖理论中的条件函数依赖,根据给定规则检测不一致数据集,对这些不一致数据求解修复方案,使得修复结果满足数据一致性要求,并给出修复结果的确定性概率。最后通过实验证明了该算法较已有的单机算法有更好的修复效果,当约束规则较少的情况下,算法执行时间呈线性增长。 With the popularity of the Internet applications in modern society, there comes the problem of increasing poor quality data. This paper investigates inconsistency problem in data quality, designs and realizes an inconsistent data detection and reparation algorithm based on Hadoop. By using the conditional functional dependency(CFD)rules in the data dependency theory, inconsistent data can be detected according to the given rules, and reparation scheme is proposed for the inconsistent data, the final reparation dataset, whose deterministic probability has been calculated, satisfies the consistent requirement. At last, this paper proves that the algorithm performs better than those on a single computer through experiments and the runtime grows linearly when the rules are not large.
出处 《计算机科学与探索》 CSCD 北大核心 2015年第9期1044-1055,共12页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金No.61472099 国家重点基础研究发展计划(973计划)No.2012CB316200 国家科技支撑计划No.2015BAH10F00~~
关键词 数据一致性 MAP REDUCE 条件函数依赖 数据质量 data inconsistency Map Reduce conditional functional dependency data quality
  • 相关文献

参考文献22

  • 1Rahm E, Do H H. Data cleaning: problems and current approaches[J]. IEEE Data Engineering Bulletin, 2000, 23(4): 3-13. 被引量:1
  • 2Ponniah P. Data warehousing fundamentals: a comprehensive guide for IT professionals[M]. Hoboken, NJ, USA: John Wiley & Sons, 2004. 被引量:1
  • 3Batini C, Scannapieco M. Data quality: concepts, methodologies and techniques[M]. New York, USA: Springer, 2006. 被引量:1
  • 4Benge J, Jordan G M W, Smith P, et a1. Global data management survey: the new economy is the data economy[R]. Coopers, Price Waterhouse, 2001. 被引量:1
  • 5Eckerson W W. Data quality and the bottom line[R/OL]. The Data Warehouse Institute (2002)[2014-09-10]. http:// www.tdwi.org/researchidisp1ay.aspx?ID=6064. 被引量:1
  • 6Andritsos P, Fuxman A, Miller R J. Clean answers over dirty databases: a probabilistic approach[C]//Proceedings of the 22nd International Conference on Data Engineering, Atlanta, USA, Apr 3-7,2006. Piscataway, NJ, USA: IEEE, 2006: 30. 被引量:1
  • 7Silbers C A, Korth H F. Database system concepts[M]. New York, USA: McGraw-Hill, 1986. 被引量:1
  • 8Ullman J D. Principles of database systems[M].[S.l.]: Computer Science Press, 1982. 被引量:1
  • 9Lenzerini M. Data integration: a theoretical perspective[C]// Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, USA, Jun 3-5, 2002. New York, NY, USA: ACM, 2002: 233-246. 被引量:1
  • 10Chen Weidong. Data quality model and mass transfer theory and method under relational algebra operations[D]. Changsha: National University of Defense Technology, 2007. 被引量:1

共引文献3

同被引文献72

引证文献13

二级引证文献153

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部