应用聚类技术分类提取Web页面

Application of Clustering Technology Category Extraction Web Pages

下载PDF

导出

摘要针对Web中数据密集型的动态页面,文本数据少,网页结构化程度高的特点,介绍了一种基于HTML结构的web信息提取方法。该方法先将去噪处理后的Web页面进行解析,然后根据树编辑距离计算页面之间的相似度,对页面进行聚类,再对每一类簇生成相应的提取规则,对Web页面进行数据提取。 According to the characteristic of data-intensive dynamic web pages, insufficient text data and page structure with a high degree in web, this paper outlines a web information extraction method based on HTML structure.This method first parses the de-noised web pages to form DOM trees, then with tree edit distance calculates the similarity between pages, clusters pages, generates the corresponding extraction rules for each category and implements web information extraction.

作者崔慧超刘莉 CUI Hui-chao, LIU Li（Southwest Jiaotong University, College of Information Science and Technology, Chengdu 610031, China）

机构地区西南交通大学信息科学与技术学院

出处《电脑知识与技术》 2010年第1期212-213,共2页 Computer Knowledge and Technology

关键词 WEB信息提取树编辑距离聚类提取规则 Web information extraction edited tree distance clustering extraction rules

分类号 TP391 [自动化与计算机技术—计算机应用技术][自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献8

1Reis.Automatic Web news extraction using tree edit distance[C].Proceedings of the 13th international conference on World Wide Web,2004:502-511. 被引量：1
2Yanghong Z,Bing L.Web Data Extraction Based on Partial Tree Alignment[c].Proceedings of the ACM,2005:76-85. 被引量：1
3Yeonjung Kim,Jeahyum Park,et al.Web Information Extraction by HTML Tree Edit Distance Matching[C].International Conference on Convergence Information Technology,2007:2455-2460. 被引量：1
4Yang W.Identifying syntactic differences between two programs[J].Software-Practice and Experience,1991,21(7):739-755. 被引量：1
5AHONEN-MYKAH.Discovery of frequent word sequences in text,template detection via data mining and its applications[R].Helsinki:University of Helsinki,2002. 被引量：1
6Belf.Frequent term-based text clustering[C].ACM,2002:436-442. 被引量：1
7Gupta S.DOM-based content extraction of HTML documents[C].Proceedings of the 12th World Wide Web Conference,2003:512-515. 被引量：1
8Sudipto Guha,Rajeev Rastogi,Kyuseok Shim.CURE:an efficient clustering algorithm for large databases[J].Information Systems,2001,26(1):35-58. 被引量：1

1高强,张敬之,耿桦,潘金贵.基于重复模式的Web信息抽取[J].计算机科学,2007,34(4):210-212. 被引量：6
2龚爱平,陈吉,裘正军,何勇.基于改进Freeman链码的柑橘簇生区域数量判别方法[J].农业机械学报,2012,43(11):203-208. 被引量：8
3岁丰.为信息系统用户创造良好的条件:述评[J].管理观察,1994,0(9):37-37.
4栾虹.HTML文档分类中的词元权重算法[J].山东师范大学学报（自然科学版）,2005,20(2):22-25. 被引量：1
5金灿.面向不同结构化程度数据源的本体学习方法研究[J].计算机时代,2010(8):10-13. 被引量：1
6郭晓,蒋宗礼.基于网页结构与链接关系的中文文本分类方法[J].现代电子技术,2010,33(22):54-56. 被引量：3
7胡军伟,秦奕青,张伟.正则表达式在Web信息抽取中的应用[J].北京信息科技大学学报（自然科学版）,2011,26(6):86-89. 被引量：39
8隋丽萍,徐承韬,李瑞芳.基于HTML结构的Web文本主题挖掘研究[J].电脑与电信,2007(1):47-51. 被引量：1
9隋丽萍,徐承韬,李瑞芳.基于HTML结构的Web文本主题挖掘研究[J].西安外事学院学报,2007,0(1):102-105.
10宋睿华,马少平,张敏.一种提高Web信息检索精度的分段检索方法[J].广西师范大学学报（自然科学版）,2003,21(A01):151-155. 被引量：2

电脑知识与技术

2010年第1期

浏览历史

内容加载中请稍等...

应用聚类技术分类提取Web页面

参考文献8

相关作者

相关机构

相关主题

浏览历史