期刊文献+

基于条件随机域的Web信息抽取 被引量:2

Web information extraction based on conditional random fields
下载PDF
导出
摘要 为了获取隐藏在Internet中的信息,基于条件随机域模型(CRF),提出了一种Web信息抽取的方法。该方法对网页样本中的每一行加注标签,确定文本特征,建立条件随机域模型,采用拟牛顿迭代方法对样本进行训练,参照学习得到的条件概率分布模型,实现网页搜索结果的抽取。与HMM模型相比,CRF模型支持网页文本的语言特征,抽取准确率高。实验结果表明,使用CRF模型的抽取准确率达到90%以上,高于使用HMM模型的抽取准确率。 In order to obtain the information hidden in the lnternet, a method based on conditional random Fields (CRF) is presented to extract web information. With this method, each line of the web documents is labeled to determine the features of the web text and then Quasi-Newton method is employed to train the web text on the basis of the CRF. According to the conditional probability model acquired from the training web documents, web search results are extracted by the proposed method. In contrast to HMM, CRF supports the use of language features of the web documents, SO it performs better in precision. Experimental results show that the precision of using CRF reaches more than 90%, which is better than that of HMM.
出处 《辽宁工程技术大学学报(自然科学版)》 EI CAS 北大核心 2007年第4期570-572,共3页 Journal of Liaoning Technical University (Natural Science)
基金 天津市科技发展计划基金资助项目(07JCZDJC067007)
关键词 条件随机域 信息抽取 网页文档 拟牛顿法 conditional random fields information extraction Web documents Quasi-Newton method
  • 相关文献

参考文献7

  • 1史庆伟,赵政,朝柯.一种基于后缀树的中文网页层次聚类方法[J].辽宁工程技术大学学报(自然科学版),2006,25(6):890-892. 被引量:11
  • 2LR.Rabincr.A tutorial on hidden Markov models and selected applications in speech recognition[J].Proceedings of the IEEE,1989,77(2):257-286. 被引量:1
  • 3Freitag D A.McCallum Information extraction with HMMs structures learned by stochastic optimization[C]//Proceedings of the Eighteenth Conference on Artificial intelligence.Edmonton:AAAI Press,2002:584-589. 被引量:1
  • 4Seymore K,McCallum A.Rosenfel Learning hidden Markov model structure for information extraction[C]//Proceedings of the AAAI'99 Workshop on Machine Learning for Information Extraction.Orlando:AAAI Press.1999:37-42. 被引量:1
  • 5MeCallum A,Freitag D,Pereira F.Maximum entropy Markov model for information extraction and segmentation.Proceeding of ICML[C]//.San Francisco:Morgan Kaofmann,2000:591-598. 被引量:1
  • 6Lafferty J,McCallum A,Pereira F.Conditional random fields:Probabilistie models for segmenting and labeling sequence data.Proceedings of ICML[C]//San Francisco:Morgan Kaufmann,2001:282-289. 被引量:1
  • 7袁亚湘.非线型规划数值方法[M].上海:上海科学技术出版社,1993. 被引量:1

二级参考文献7

  • 1郭伟,唐晓君,刘万军.一种基于划分的聚类算法分析与改进[J].辽宁工程技术大学学报(自然科学版),2004,23(6):826-828. 被引量:4
  • 2Karypis G,Han EH,Kumar V.Chameleon:hierarchical clustering using dynamic modeling[J].Computer,1999(32):68-75. 被引量:1
  • 3Krishnapuram R,Kummamuru K.Automatic taxonomy generation:Issues and possibilities[J].LNCS:In:Proceedings of Fuzzy Sets and Systems (IFSA),Springer-Verlag Heidelberg,2003,27(15):52-63. 被引量:1
  • 4Sanderson M,Croft W B.Deriving concept hierarchies from text[C]//Proceedings of SIGIR,1999::206-213. 被引量:1
  • 5Lawrie D,Croft W B,Rosenberg A.Finding topic words for hierarchical summarization[C]//Proceedings ofSIGIR,2001:349-357. 被引量:1
  • 6HJ Zeng,QC He,Z Chen,WY Ma,J Ma.Learning to Cluster Web Search Results[C]//Proceedings of SIGIR,2004:210-217. 被引量:1
  • 7Zamir O,Etzioni O.Web document clustering:A feasibility demonstration[C]//Proceedings of SIGIR,1998:46-54. 被引量:1

共引文献10

同被引文献11

引证文献2

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部