摘要
描述网络教学的数据仓库中包含了从各种数据源导入的大量数据,数据的质量问题会直接影响教学评价的效果。针对学生重复信息的处理,文中提出了基于数据类型进行分词的策略,结合编辑距离算法可有效检测出重复的学生基本信息,实验结果表明该方法能有效提高算法的执行效率及检测精度。
Data warehouse for network teaching includes a variety of data which are from different data sources,Data quality problem will di- rectly influence the effect of teaching evaluation.Aiming at the processing of students duplicate information,an segment strategy based on data type is proposed.The similarity computation algorithm of edit distance is presented:The experiment results indicate that this method can detect approximately duplicated records effectually,the algorithm running efficiency and detect precision can be improved.
出处
《网络与信息》
2011年第8期40-41,共2页
Network & Information
基金
辽宁省十一五规划项目
课题编号:JG 10DB192
关键词
相似重复记录
分词
编辑距离算法
Approximately duplicated records
Segment
algorithm of edit distance