摘要
针对目前冲突数据源的质量评价模型仅考虑准确度与精确度2个方面,没有考虑数据源提供错误描述与提供空值对数据源质量会产生不同影响的情况,通过将数据源提供的错误描述定义为主动错误,并将数据源没有为实体提供描述定义为被动错误,从主动错误、被动错误2个方面建立数据源质量模型.该模型以敏感度、明确度代替了准确度与精确度;为了处理多真值问题,预先合并数据源对实体的描述,并定义了合并描述的包含关系及包含度计算模型;在包含度计算模型的基础上,提出了基于描述包含度的冲突数据源质量评价算法(TFDQ).在通用数据集Books-Authors上的实验表明,与Vote算法、TruthFinder算法相比,TFDQ算法实验结果更接近真实情况.
Existing evaluating models for conflicting data sources usually take nothing but accuracy and precision into account, ignoring different impacts to the quality of data sources caused by false data values and empty values. In this paper, false descriptions provided by data sources were defined as initiative errors, while empty values were defined as passive errors. A new quality evaluating model was constructed, in which accuracy and precision were respectively substituted by sensitivity and specificity. Multiple descriptions from different sources were merged and a notion of inclusion relation as well as a calculating model for inclusion degrees was proposed as pretreatments to deal with multi-value problems. An evaluating algorithm TFDQ for conflicting data source quality based on the calculating model was put forward. Experiments on the universal data set Books-Authors show that the result from TFDQ is closer to the reality comparing to the classic Vote and TruthFinder algorithms.
出处
《浙江大学学报(工学版)》
EI
CAS
CSCD
北大核心
2015年第2期303-308,共6页
Journal of Zhejiang University:Engineering Science
基金
国家自然科学基金资助项目(51475097)
国家“十二五”科技支撑计划项目(2012BAF12B14)
贵州省科技资助项目(黔科合JZ字[2014]2001,黔科合计Z字[2012]4009)
关键词
数据集成
数据源质量
真值发现
data integration
quality of data sources
truth finder