摘要
提出了一种基于改进的隐条件随机场的异构Web数据源数据抽取算法。通过对隐条件随机场进行的改进,对隐含变量进行更为准确的计算,并且克服了该模型的性能严重依赖于初始参数选择的问题,而且进行模型训练时不需要大量的人工标注的样本数据。实验结果表明,对比已有方法,本文算法在对具有缺省属性以及多属性特征的网站进行数据抽取时,在查全率,查准率以及F1值上都获得了令人满意的性能。
In this paper,we propose a novel heterogeneous Web data extraction algorithm based on modified hidden conditional random fields model.Firstly,the hidden conditional random fields model is modified to obtain more accurate calculation of implicit variables,and the problem that the model's performance is heavily dependent on the choice of initial parameters is well solved.Moreover,the proposed model does not require a lot of manual labeling sample data to construct training data.Experimental results show that compared with the existing method,the proposed algorithm can obtain satisfactory performance both in websites with the default attributes and the websites with multi-attributes.
出处
《科技通报》
北大核心
2012年第8期168-170,共3页
Bulletin of Science and Technology
关键词
条件随机场
隐条件随机场
WEB数据抽取
判别式模型
conditional random fields
hidden conditional random fields
Web data extraction
discriminative model