期刊文献+

应用聚类技术分类提取Web页面

Application of Clustering Technology Category Extraction Web Pages
下载PDF
导出
摘要 针对Web中数据密集型的动态页面,文本数据少,网页结构化程度高的特点,介绍了一种基于HTML结构的web信息提取方法。该方法先将去噪处理后的Web页面进行解析,然后根据树编辑距离计算页面之间的相似度,对页面进行聚类,再对每一类簇生成相应的提取规则,对Web页面进行数据提取。 According to the characteristic of data-intensive dynamic web pages, insufficient text data and page structure with a high degree in web, this paper outlines a web information extraction method based on HTML structure.This method first parses the de-noised web pages to form DOM trees, then with tree edit distance calculates the similarity between pages, clusters pages, generates the corresponding extraction rules for each category and implements web information extraction.
作者 崔慧超 刘莉 CUI Hui-chao, LIU Li(Southwest Jiaotong University, College of Information Science and Technology, Chengdu 610031, China)
出处 《电脑知识与技术》 2010年第1期212-213,共2页 Computer Knowledge and Technology
关键词 WEB信息提取 树编辑距离 聚类 提取规则 Web information extraction edited tree distance clustering extraction rules
  • 相关文献

参考文献8

  • 1Reis.Automatic Web news extraction using tree edit distance[C].Proceedings of the 13th international conference on World Wide Web,2004:502-511. 被引量:1
  • 2Yanghong Z,Bing L.Web Data Extraction Based on Partial Tree Alignment[c].Proceedings of the ACM,2005:76-85. 被引量:1
  • 3Yeonjung Kim,Jeahyum Park,et al.Web Information Extraction by HTML Tree Edit Distance Matching[C].International Conference on Convergence Information Technology,2007:2455-2460. 被引量:1
  • 4Yang W.Identifying syntactic differences between two programs[J].Software-Practice and Experience,1991,21(7):739-755. 被引量:1
  • 5AHONEN-MYKAH.Discovery of frequent word sequences in text,template detection via data mining and its applications[R].Helsinki:University of Helsinki,2002. 被引量:1
  • 6Belf.Frequent term-based text clustering[C].ACM,2002:436-442. 被引量:1
  • 7Gupta S.DOM-based content extraction of HTML documents[C].Proceedings of the 12th World Wide Web Conference,2003:512-515. 被引量:1
  • 8Sudipto Guha,Rajeev Rastogi,Kyuseok Shim.CURE:an efficient clustering algorithm for large databases[J].Information Systems,2001,26(1):35-58. 被引量:1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部