期刊文献+

基于结构分析和实体识别的信息集成 被引量:5

Information Integration Based on Structural Analysis and Entity Recognition
下载PDF
导出
摘要 针对海量的Web数据 ,提出了一种基于文档结构分析和实体识别的Web信息提取和集成方法 ,利用XML强大的数据描述能力 ,灵活组织集成的Web文档信息内容 方法首先将半结构化的HTML文档转化成具有模式结构的XML文档 ,然后使用实体识别的技术对不同主题区域进一步抽取出格式良好的数据 ,最后将得到的多数据类型的信息集成到数据库中 ,以支持进一步的分析和查询 Web information is expanding quickly with the dramatic expanse of Internet In this paper a Web information extraction and integration method is proposed, which is based on structure analysis and entity extraction Firstly it converts the semi structured HTML documents to formal XML documents with schema using XML technology Then significative information can be extracted from interesting area through entity recognition process Finally tremendous formal information can be integrated into database, which can support advanced query and analysis This approach also defines some patterns which can deal with heterogeneity of Web documents and achieve individuation of integrated documents The results of experiments validate the feasibility of the approach
出处 《计算机研究与发展》 EI CSCD 北大核心 2004年第10期1823-1828,共6页 Journal of Computer Research and Development
基金 国家"九七三"重点基础研究发展规划基金项目 (G19990 3 2 70 5 ) 国家"八六三"高技术研究发展计划基金项目数据库管理系统及其应用重大专项课题 ( 2 0 0 2AA4Z3 440 )
关键词 信息提取 信息集成 XML WRAPPER 实体识别 information extraction information integration XML wrapper entity extraction
  • 相关文献

参考文献16

  • 1M E Califf, R J Mooney. Relational learning of pattern-match rules for information extraction. In: Proc of the 16th National Conf on Artificial Intelligence and the 11th Conf on Innovative Applications of Artificial Intelligence. Menlo Park, California:AAAI Press/The MIT Press, 1999. 328~334 被引量:1
  • 2D Freitag. Machine learning for information extraction in informal domains. Machine Learning, 2000, 39(2-3): 169~202 被引量:1
  • 3S SoderLan. Learning information extraction rules for semistructured and free text. Machine Learning, 1999, 34(1-3): 233~272 被引量:1
  • 4A Sahuguet, F Azavant. Building intelligent Web applications using lightweight wrappers. Data and Knowledge Engineering,2001, 36(3): 283~316 被引量:1
  • 5Liu L, Pu C, Han W. XWRAP: An XML-enabled wrapper construction system for Web information sources. In: Proc of the 16th Int'l Conf on Data Engineering. Los Alamitos, California:IEEE Computer Society, 2000. 611~621 被引量:1
  • 6R Baumgartner, S Flesca, G Gottlob. Visual Web information extraction with Lixto. In: Proc of the 27th Int'l Conf on Very Large Data Bases. San Francisco: Morgan Kaufmann, 2001. 119~ 128 被引量:1
  • 7V Crescenzi, G Mecca. Grammars have exceptions. Information Systems, 1998, 23(9): 539~565 被引量:1
  • 8B Adelberg. NoDoSE-A tool for semi-automatically extracting structured and semi-structured data from text documents. In: Proc of the 1998 ACM SIGMOD Int'l Conf on Management of Data.New York: ACM Press, 1998. 283~294 被引量:1
  • 9D Bikel, R Schwarta, R Weisehedel. An algorithm that learns what's in a name. Machine Learning, 1997, 34(1-3): 211~231 被引量:1
  • 10D Freitag, A L McCallum. Information extraction using HMMs and shrinkage. In: Proc of the 16th National Conf on Artificial Intelligence. Menlo Park, California: AAAI Press, 1993. 31~36 被引量:1

同被引文献27

引证文献5

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部