摘要
为了获取隐藏在Internet中的信息,基于条件随机域模型(CRF),提出了一种Web信息抽取的方法。该方法对网页样本中的每一行加注标签,确定文本特征,建立条件随机域模型,采用拟牛顿迭代方法对样本进行训练,参照学习得到的条件概率分布模型,实现网页搜索结果的抽取。与HMM模型相比,CRF模型支持网页文本的语言特征,抽取准确率高。实验结果表明,使用CRF模型的抽取准确率达到90%以上,高于使用HMM模型的抽取准确率。
In order to obtain the information hidden in the lnternet, a method based on conditional random Fields (CRF) is presented to extract web information. With this method, each line of the web documents is labeled to determine the features of the web text and then Quasi-Newton method is employed to train the web text on the basis of the CRF. According to the conditional probability model acquired from the training web documents, web search results are extracted by the proposed method. In contrast to HMM, CRF supports the use of language features of the web documents, SO it performs better in precision. Experimental results show that the precision of using CRF reaches more than 90%, which is better than that of HMM.
出处
《辽宁工程技术大学学报(自然科学版)》
EI
CAS
北大核心
2007年第4期570-572,共3页
Journal of Liaoning Technical University (Natural Science)
基金
天津市科技发展计划基金资助项目(07JCZDJC067007)
关键词
条件随机域
信息抽取
网页文档
拟牛顿法
conditional random fields
information extraction
Web documents
Quasi-Newton method