期刊文献+

一种基于统计学特征和DOM树的网页去噪技术 被引量:2

Approach of Eliminating Web Page Noise Based on Statistical Characteristics and DOM tree
下载PDF
导出
摘要 针对特定的网站或网页中抽取出用户感兴趣的信息这一问题,分析现有去噪技术的优缺点,提出了一种基于统计学特征和DOM树的Web页面去噪方法。该方法首先对原始网页进行预处理,然后分析网页的统计学特征,结合启发式的抽取规则,对网页进行去噪。实验证实该方法在较少人为干预的基础上能达到较好的抽取效果。 In view of extracting the user interested information from specific websites or web pages,this paper proposes an approach of eliminating web page noise based on statistical characteristics and DOM tree after analyzing the advantages and disadvantages of existing web page noise eliminating algorithms.After pre-processing to the original pages,the approach analyzes their statistical characteristics combining with heuristic extraction rules to remove the noise in the web pages.Experiment shows that the approach achieves better retrieval results with relatively little human intervention.
出处 《重庆理工大学学报(自然科学)》 CAS 2011年第1期54-58,共5页 Journal of Chongqing University of Technology:Natural Science
基金 重庆市科技攻关项目(CSTC 2010AC6074) 重庆交通大学研究生教育创新基金资助项目 重庆交通大学实验教学改革与研究基金资助项目(SYJ200922)
关键词 DOM 统计学特征 信息检索 DOM statistical characteristics information retrieval
  • 相关文献

参考文献8

  • 1SODERLAND S. Learning information extraction rules for semi-structured and free text [ J]. Journal of Machine Learning, 1999,34( 1 ) :2332 -2721. 被引量:1
  • 2CHANG Chia hui, KAYED M, GI RGIS M R, et al. A survey Of Web information extraction systems [ J ]. IEEE Trans. on Knowledge and Data Engineering, 2006, 18 (10) :14112-14281. 被引量:1
  • 3杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法(英文)[J].软件学报,2008,19(2):209-223. 被引量:45
  • 4Lin S H, Ho J M. Discovering Informative Content Blocks from web Documents [ C ]// Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data mining. [S. l. ] :[s. n. ] ,2002:588 - 593. 被引量:1
  • 5Wong,W, Fu A W. Finding Structure and Characteris- tics of web Documents for Classification [ C ]// ACM SIGMOD Workshop on Researeh Issues in Data Mining and Knowledge Discovery. [ S. l. ] : [ s. n. ], 2000 : 96 - 105. 被引量:1
  • 6Embley D W, Jiang Y, Ng Y K. Record-boundary discovery in Web documents [C]//ACM SIGMOD Record. [S. l. ]: [s. n. ], 1999:467 -478. 被引量:1
  • 7Chakrabarti S, Joshi M,Tawde V. Enhanced topic distillation using text, markup tags, and hyperlinks [ C ]//Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. [S. l. ] :[s. n. ] ,2001:208 -216. 被引量:1
  • 8Htmlparser [ EB/OL]. [ 2010 - 03 - 09 ]. http ://html-parser. sourceforge, net/. 被引量:1

二级参考文献12

  • 1Chang CH, Kayed M, Girgis MR, Shaalan K. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10): 1411-1428. 被引量:1
  • 2Gold ME. Language identification in the limit. Information and Control, 1967,10(5):447-474. 被引量:1
  • 3Laender AHF, Ribeiro-Neto BA, da Silva AD, Teixeira JS. A brief survey of Web data extraction tools. SIGMOD Record, 2002,31 (2):84-93. 被引量:1
  • 4Arasu A, Hector GM. Extracting structured data from Web pages. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 337-348. 被引量:1
  • 5EXALG datasets, http://infolab.stanford.edu/-arvind/extract/ 被引量:1
  • 6TBDW v1.02, http://daisen.cc.kyushu-u.ac.jp/TBDW/testbed/ 被引量:1
  • 7Zhao HK, Meng WY, Wu ZH, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l Conf. on World Wide Web (WWW 2005). Chiba: ACM Press, 2005.66-75. 被引量:1
  • 8Simon K, Lausen G. VIPER: Augmenting automatic information extraction with visual perceptions. In: Proc. of the ACM CIKM Int'l Conf. on Information and Knowledge Management. Bremen: ACM Press, 2005. 381-388. 被引量:1
  • 9Crescenzi V, Mecca G, Meraldo P. RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB 2001). Roma: Morgan Kaufmann Publishers, 2001. 109-118. 被引量:1
  • 10Wang JY, Lochovsky FH. Data extraction and label assignment for Web databases. In: Proc. of the 12th Int'l World Wide Web Conf. (WWW 2003). Budapest: ACM Press, 2003. 187-196. 被引量:1

共引文献44

同被引文献38

引证文献2

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部