摘要
利用序列数据语义标注学习方法来解决异构数据源的模式匹配问题,将从多个网站抽取的异构Web对象集成到关系数据库中.在线性链条件随机场的基础上提出了一种可叠加多阶链的组合条件随机场模型.该模型可以在由手工标注数据和关系数据库记录组成的联合样本集上进行训练,因此减少了对繁琐手工标注样本的依赖;此外,通过在线性链条件随机场模型上叠加高阶链,使得该模型能够有效地处理状态变量间的长距离依赖.在多个领域的真实数据集上的实验和分析结果表明,所提出的方法能显著提高异构Web数据的字段标注性能.
This paper studies the problem of integrating heterogeneous semi-structured Web objects into relational database. A generalized sequential learning model named the Combined Conditional Random Fields is presented for solving the problem of schema matching between pairs of heterogeneous Web data sources. The proposed model is able to learn on the manually labeled training data and unlabeled database records, thereby reducing the dependence on tediously labeled samples. It also provides a novel way to incorporate the two-dimensional neighborhood dependencies between Web data elements. Moreover, a constrained Viterbi algorithm is implemented to resolve the imposed labels inference for optimal data integration. Experimental results using a large number of Web pages from diverse domains show that the proposed method can improve the matching accuracy significantly.
出处
《西安电子科技大学学报》
EI
CAS
CSCD
北大核心
2007年第1期126-130,153,共6页
Journal of Xidian University
基金
国家部委预研项目(41101050108)
西安电子科技大学博士生创新基金项目(05013)