摘要
数据质量在信息管理系统中具有重要意义。然而,由于用户拼写、录入、系统升级等原因导致各种数据质量问题的出现。数据清洗的目的就是检测出脏数据并修复它们。而当前的清洗工具缺乏灵活性和扩展性,基于此,本文提出了一个基于规则和数据学习的通用清洗模型。模型实现了动态规则学习和动态数据学习等关键技术。通过规则匹配和反馈学习过程实现了动态清洗规则最佳选择;通过字段学习和元表学习过程实现了动态数据的初始化。实验证明,应用该模型保证了动态数据的质量,提高了当前清洗工具的灵活性和扩展性。
Data quality is quite significant for management information systems.However,various data quality problems emerge due to the user spelling,recording and system upgrades.The purpose of data cleaning is to effectively detect the dirty data and repair them.And on account of the limited extensibility and flexibility of current data cleaning tools,this paper proposes a universal data cleaning modeling based on rule learning and data learning.It implements the key technologies of the modeling,such as dynamic rule learning and dynamic data learning in detail.By the learning process of rule matching and rule feedback,the model realizes the optimal cleaning rule selection.By the learning process of field and metatable,the model achieves initializing of dynamic data information.Experiments show that the application of the model ensures the quality of dynamic data,and improves the flexibility and expansibility of the cleaning tools.
出处
《陕西教育学院学报》
2011年第3期89-93,共5页
Journal of Shaanxi Institute of Education
基金
陕西教育学院科研基金项目(10KJ040)