期刊文献+

基于SVM和扩展条件随机场的Web实体活动抽取 被引量:14

Extracting Web Entity Activities Based on SVM and Extended Conditional Random Fields
下载PDF
导出
摘要 在传统信息抽取的基础上,研究Web实体活动抽取,基于格语法对实体活动进行了形式化定义,并提出一种基于SVM(supported vector machine)和扩展条件随机场的Web实体活动抽取方法,能够从Web上准确地抽取实体的活动信息.首先,为了避免人工标注训练数据的繁重工作,提出一种基于启发式规则的训练数据生成算法,将语义角色标注的训练数据集转化为适合Web实体活动抽取的训练数据集,分别训练支持向量机分类器和扩展条件随机场.在抽取过程中,通过分类器获得包含实体活动的语句,然后利用扩展条件随机场对传统条件随机场中不能利用的标签频率特征和关系特征建模,标注自然语句中的待抽取信息,提高标注的准确率.通过多领域的实验,其结果表明,所提出的抽取方法能够较好地适用于Web实体活动抽取. On the basis of the traditional methods extracting information,this paper defines the formal model of entity activity based on case grammar and presents a method based on supported vector machine and extended condition random fields to extract Web entity activities accurately.First,in order to automatically train the machine learning models,the study puts forward a heuristic method to transform the semantic role labeling training data into the training data of entity activity extraction.Next,the study trains a support vector machine classifier and extends condition random fields using the training data.Third,using the classifier,the study distinguishes the sentences that contain Web entity activities.The paper also proposes forward and extends condition random fields to model the frequency and relationship feature.The traditional conditional random fields cannot model this while the new model can label the entity activity information in natural language sentences more accurately.Finally,the experimental results show that the method is effective in multi-domains and can be applied to Web entity activity extraction.
出处 《软件学报》 EI CSCD 北大核心 2012年第10期2612-2627,共16页 Journal of Software
基金 国家自然科学基金(61003051) 国家科技支撑计划(2009BAH44B02) 山东省自然科学基金(2009ZRB019RW) 山东省科技攻关计划(2010GGX10108)
关键词 信息抽取 格语法 实体活动 支持向量机 扩展条件随机场 information extraction case grammar entity activity support vector machine extended condition random fields
  • 相关文献

参考文献3

二级参考文献54

  • 1周俊生,戴新宇,尹存燕,陈家骏.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5):804-809. 被引量:112
  • 2Zhai Y H, Liu B. Web data extraction based on partial tree alignment//Proceedings of the 14th International Conference on World Wide Web. Chiba, Japan, 2005:76-85. 被引量:1
  • 3Chang C H, Kayed M, Girgis M R, Shaalan K. A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10) : 1411-1428. 被引量:1
  • 4Creseenzi V, Mecca G, Merialdo P. Roadrunner: Towards automatic data extraction from large web sites//Proceedings of the Very Large DataBase. Roma, Italy, 2001 : 109-118. 被引量:1
  • 5Nie Zai-Qing, Wen Ji-Rong, Ma Wei-Ying. Webpage understanding: Beyond page-level search. SIGMOD Record, 2008, 37(4):48-54. 被引量:1
  • 6Wong Tak-Lam, Lam Wai. Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. IEEE Transactions on Knowledge and Data Engineering, to appear. 被引量:1
  • 7Lerman K, Getoor L, Minton S, Knoblock C. Using the structure of web sites for automatic segmentation of tables// Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. Paris, France, 2004:119-130. 被引量:1
  • 8Embley D, Campbell D, Jiang Yet al. COnceptual-modelbased data extraction from multiple-record web pages. Data and Knowledge Engineering, 1999, 31(3):227-251. 被引量:1
  • 9Mukherjee S, Ramakrishnan I V, Singh A. Bootstrapping semantic annotation for content-rich html documents//Proceedings of the 21st International Conference on Data Engineering. Tokyo, Japan, 2005:583-593. 被引量:1
  • 10Arlotta L, Crescenzi V, Mecca G, Merialdo P. Automatic annotation of data extracted from large web sites//Proceedings of the WebDB. San Diego, USA, 2003:7-12. 被引量:1

共引文献84

同被引文献139

引证文献14

二级引证文献77

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部