期刊文献+

基于密度和半监督学习的数据修复与聚类 被引量:1

Data repairing and clustering based on density and semi-supervised learning
下载PDF
导出
摘要 针对现有数据修复算法需要数据集的完整性依赖等函数、不适用于简单数据集、不能充分利用背景知识等问题,提出一种基于密度和半监督学习的修复和聚类算法。遵循数据修复的最小改变原则,利用样本集自身的密度信息和背景知识形成临时聚类簇;利用成对约束将临时聚类簇进行分割或者合并,形成最终聚类簇,在聚类的同时完成不精确数据的修复。实验结果表明,该算法适用于具有简单模式的样本集,是对现有基于完整性约束数据修复算法的扩展,提高了数据修复正确率和聚类精度。 Aiming at the problems of existing data repair algorithms that it requires integrity constraints,it is not suitable for dataset with simple schema and it insufficiently uses background knowledge,a repairing and clustering algorithm based on density and semi-supervised learning was proposed,referring to the principle of minimum change on data repairing.The temporary cluster was formed using the density information and background knowledge of the dataset.The temporary clusters were segmented or merged to form the final cluster using pairwise constraints.The repair of inaccurate data was completed.Experimental results show that the proposed algorithm is suitable for not only dataset with simple schema,but also the existing data repairing algorithms based on integrity constraints,and it improves the accuracy of data repairing and clustering accuracy.
作者 张倩倩 李国和 郑艺峰 ZHANG Qian-qian;LI Guo-he;ZHENG Yi-feng(Beijing Key Lab of Petroleum Data Mining,China University of Petroleum(Beijing),Beijing 102249,China;College of Information Science and Engineering,China University of Petroleum(Beijing),Beijing 102249,China;PanPass Institute of Digital Identification Management and Internet of Things,Beijing 100029,China;Fujian Provincial Key Laboratory of Data Science and Intelligence Applications,Minnan Normal University,Zhangzhou 363000,China;College of Computer Science,Minnan Normal University,Zhangzhou 363000,China)
出处 《计算机工程与设计》 北大核心 2020年第3期676-681,共6页 Computer Engineering and Design
基金 国家自然科学基金项目(61701213) 油气国家重点专项子课题基金项目(G-5800-08-ZS-WX) 中国石油大学(北京)克拉玛依校区科研启动基金项目(RCYJ2016B-03-001) 福建省教育厅中青年基金项目(JA15300)。
关键词 数据质量 数据清理 数据修复 成对约束 密度聚类 data quality data cleaning data repairing pair constraints density-based clustering
  • 相关文献

参考文献1

二级参考文献14

  • 1Galiano F, Cubero J, Cuenca F, et al. Relational decomposition through partial functional dependencies [J]. Data & Knowledge Engineering, 2002, 43(2) : 207-234. 被引量:1
  • 2Wolf G, Khatri H, Chokshi B, et al. Query processing over incomplete autonomous databases [C] //Proc of the 33rd Int Conf on Very Large Data Bases. New York: ACM, 2007: 651-662. 被引量:1
  • 3Ilyas I, Markl V, Haas P, et al. Cords: Automatic discovery of correlations and soft functional dependencies [C] //Proc of the 2004 ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2004:647-658. 被引量:1
  • 4Nambiar U, Kambhampati S. Answering imprecise queries over autonomous Web databases [C] //Proc of the 22nd Int Conf on Data Engineering. Los Alamitos, CA: IEEE Computer Society, 2006:45-45. 被引量:1
  • 5Wolf G, Khatri H, Chen Yi, et al. Quic: A system for handling imprecision & incompleteness in autonomous databases (demo)[C] //Proc of the 3rd Biennial Conf on Innovative Data Systems Research. New York: ACM, 2007: 263-268. 被引量:1
  • 6Huhtala Y, Karkkainen J, Po::kka P, et al. TANE: An efficient algorithm for discovering functional and approximate dependencies [J]. The Compuler Journal, 1999, 42 (2): 100-111. 被引量:1
  • 7Giannella C, Robertson E. On approximation measures for functional dependencies [J]. Information Systems, 2004, 29 (6) : 483-507. 被引量:1
  • 8Yao Hong, Hamilton H. Mining functional dependencies from data [J]. Data Mining and Knowledge Discovery, 2008, 16(2) : 197-219. 被引量:1
  • 9De S, Kambhampati S. Defining and mining functional dependencies in probabilistic databases EOI.:. [2014-05-20]. http://arxiv, org/abs/1005. 4714. 被引量:1
  • 10Abiteboul S, Hull R, Vianu V. Foundations of Databases [M]. Reading, MA: Addison Wesley, 1995. 被引量:1

共引文献5

同被引文献10

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部