基于网站拓扑的网页内容精化算法被引量：1

Web Content Refining Algorithm Based on Website Topological Information

下载PDF

导出

摘要通过对网页中无关信息分布特点和模式的分析,提出了一种新颖的网页内容精化算法——基于网站拓扑信息的网页无关内容识别与剔除算法。该算法在对网页内容进行分区后,认定与父节点网页具有相同内容的分区为该网页无关信息内容分区并将其删除。测试结果表明,该算法具有较高的识别率及精度。 Based on the observation and analysis of occurrence of the type of trivial information inside Web pages, this paper proposes a website topology based Web content refining algorithm. The algorithm partitions the content of web page into five sections, and then prunes the sections which have the same contents with the father node of Web page in website graph representation. Experimental results show the algorithm has a high ratio of precise and recall.

作者李锋

机构地区华南理工大学工商管理学院

出处《计算机工程》 CAS CSCD 北大核心 2007年第21期50-51,54,共3页 Computer Engineering

基金国家自然科学基金资助项目(70472041) 广州市哲学社会科学发展"十一五"规划基金资助项目

关键词网页内容精化信息提取网站拓扑 Web content refinement information retrieval website topology

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1Cutler M,Shih Y,Meng W.Using the Structure of HTML Documents to Improve Retreival[C]//Proc.of USENIX Symposium on Internet Technologies and Systems.1997:241-251. 被引量：1
2Brin S,Page L.The Anatomy of a Large Scale Hyper-textual Web Search Engine[J].Computer Networks and ISDN Systems.1998,30(1/7):107-117. 被引量：1
3Davulcu H,Vadrevu S,Nagarajan S.OntoMiner:Bootstrapping and Populating Ontologies from Domain Specific Web Sites[J].Intelligent Systems,2003,18(5):24-33. 被引量：1
4Buttler D,Liu L,Pu C.A Fully Automated Object Extraction System for the World Wide Web[C]//Proceedings of the 2001 International Conference on Distributed Computing Systems.2001:361-370. 被引量：1
5傅骞,温晓辉.开放式Web信息抽取系统研究与实现[J].北京师范大学学报（自然科学版）,2005,41(6):594-598. 被引量：3

二级参考文献8

1Robert Gaizauskas,Yorick Wilks.Information extraction:beyond document retrieval [J].Journal of Documentation,1998,54 (1):70 被引量：1
2Han Jiawei,Karnber M.Data mining concxepts and techiques[M].范明,孟小峰,译.北京:北京工业出版社,2001 被引量：1
3Wadie Sirgany.An introduction to the art and mathematics of cluster analysis[EB/OL].[2004-11-10].http://www.i-m-i.info/bytesofscience/archives/clus.htm 被引量：1
4Dayne Freitag.Information extraction from HTML:application of a general machine learning approach[C]//Proceedings of the 15'th National Conference on Artificial Intelligence (AAAI-98),Madison:Wisconsin,1998 被引量：1
5Mary Elaine Califf.Relational learning techniques for natural language information extraction[EB/OL].[2005-03-10].http://www.cs.utexas.edu/users/mi/papers/rapier-dissertation98.pdf 被引量：1
6Ion Muslea,Steve Minton,Craig Knoblock.Hierarchical wrapper induction for semi-structured sources [J].Journal of Autonomous Agents and Multi-Agent Systems,2001,4:93 被引量：1
7Liu Ling,Calton Pu,Han Wei.XWRAP:an XML-based wrapper construction system for web information sources[EB/OL].[2005-03-10].http://citeseer.ist.psu.edu/215418.html 被引量：1
8李保利,陈玉忠,俞士汶.信息抽取研究综述[J].计算机工程与应用,2003,39(10):1-5. 被引量：178

共引文献2

1刘顺来.基于聚类分析的Web信息搜索算法研究[J].电脑与电信,2007(6):53-56.
2常红要,朱征宇,陈烨,张鹏,曾丽芳.基于HTML标记用途分析的网页正文提取技术[J].计算机工程与设计,2010,31(24):5187-5191. 被引量：15

引证文献1

1吴文辉.网页新闻内容自动采集[J].电脑编程技巧与维护,2014(14):82-82.

1王娟,唐宝珍.基于兴趣的轻博客网站拓扑特性分析[J].电脑知识与技术,2013,9(8):5033-5036.
2冯洁,陶宏才.一种基于用户访问模式优化网站结构的算法[J].微电子学与计算机,2007,24(7):122-124. 被引量：5
3董祥和,仲丛友,董荣和.有趣Web日志关联规则挖掘算法[J].计算机工程与设计,2009,30(4):1036-1038. 被引量：3
4刘怀辉.基于多重网格法的三角网格精化算法及其改进[J].现代电子技术,2007,30(7):159-161.
5顾韵华,王兴,丁妮.Web应用安全扫描系统及关键技术研究[J].计算机工程与设计,2008,29(18):4715-4717. 被引量：7
6王瑾,马凯,杨红丽.Web服务编排场景的XML Schema消息类型精化[J].计算机应用与软件,2017,34(2):27-34. 被引量：2
7程苗.基于云计算的用户浏览偏爱路径挖掘算法[J].计算机工程与应用,2011,47(29):85-89. 被引量：6
8任华,杨雪春,谢世坤.基于映射法的曲面结构化四边形网格剖分及局部加密技术[J].南昌大学学报（工科版）,2005,27(1):17-19.
9任华,杨雪春.曲面结构化四边形网格剖分及局部加密技术[J].江西科学,2005,23(1):14-17. 被引量：1
10覃姜维,郑启伦,马千里,韦佳,林古立.多步桥接精化迁移学习[J].华南理工大学学报（自然科学版）,2011,39(5):108-114. 被引量：1

计算机工程

2007年第21期

浏览历史

内容加载中请稍等...

基于网站拓扑的网页内容精化算法被引量：1

参考文献5

二级参考文献8

共引文献2

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于网站拓扑的网页内容精化算法 被引量：1

参考文献5

二级参考文献8

共引文献2

引证文献1

相关作者

相关机构

相关主题

浏览历史

基于网站拓扑的网页内容精化算法被引量：1