期刊文献+

基于属性标签的Web数据挖掘 被引量:1

WEB DATA MINING BASED ON ATTRIBUTE TAGS
下载PDF
导出
摘要 Div+CSS流行于Web页面的布局,在这种布局下,网页中很多数据记录以重复结构的形式聚集在一个层级。提出一种基于属性标签的Web数据提取的方法,构造带有属性标签的DOM树,通过比较属性标签的值挖掘重复模式,制定三个规则排除干扰模式,找到数据域,进而从数据域中提取出数据记录。 Div+CSS is popular in Webpage layout.On such layout,a lot of data records of Webpage gather in a layer in the form of repetition structure.This paper proposes a method to extract the Web data based on attribute tag of Webpage.By constructing a DOM tree with the attribute tag and comparing the value of the tag attributes,repetitive patterns are mined.Three rules are made to remove the disturbing patterns and to identify the data regions.Then the data records in data regions can be extracted.
出处 《计算机应用与软件》 CSCD 北大核心 2012年第11期156-159,共4页 Computer Applications and Software
基金 上海市信息安全综合管理技术研究重点实验开放课题资助项目(AGK2009008)
关键词 WEB安全 WEB数据挖掘 HTML DOM 属性标签 Web security Web data mining HTML DOM Attribute tags
  • 相关文献

参考文献9

  • 1Nicholas Kushmerick, Daniel S Weld, Robert Doorenbos. Wrapper induction for information extraction [ J ]. Aichi : Morgan Kaufmann Publishers, 1997:729 - 737. 被引量:1
  • 2Muslea I, Minton S, Knoblock C. A Hierarchical Approach to Wrapper Induction[ C]//Proceedings of the 3rd International Conference on Autonomous Agents, 1999. 被引量:1
  • 3Soderland S. Learning Information Extraction Rules for Semi-structured and Free Text[ J]. Machine Learning, 1999. 被引量:1
  • 4Crescenzi V, Mecca G, Merialdo P. ROADRUNNER: Towards automatic data extraction fmmlarge web sites [ C ]//Proc of the 27th VLDB Conf, 2001 : 109 - 118. 被引量:1
  • 5Chang Chia-hui, Lui Shao-chen. IEPAD:information extraction based on pattern discovery [ C ]//Proceedings of the tenth international conference on World Wide Web, 2001:681 -688. 被引量:1
  • 6Liu B ,Grossman R L,Zhai Y. Mining data records in Web pages[ C]// Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003:601 -606. 被引量:1
  • 7高强,张敬之,耿桦,潘金贵.基于重复模式的Web信息抽取[J].计算机科学,2007,34(4):210-212. 被引量:6
  • 8顾韵华,田伟.基于DOM模型扩展的Web信息提取[J].计算机科学,2009,36(11):235-237. 被引量:21
  • 9刘亚东,彭舰,张达平.基于智能的网页信息提取系统的研究与设计[J].四川大学学报(自然科学版),2009,46(4):957-962. 被引量:7

二级参考文献27

共引文献30

同被引文献13

  • 1Yfacca F,Lanzi P.Mining interesting knowledge from web logs:a survey[J].Data and Knowledge Engineering,2005,53(3):225-241. 被引量:1
  • 2Runker T,Beadek J.Web mining with relational clustering[J].International Journal of Approximate Reasoning,2003,32(2):217-236. 被引量:1
  • 3Liao T W.Clustering of time series data-a survey[J].Pattern Recognition,2005,38:1857-1874. 被引量:1
  • 4Rees J,Koehler G.Learning genetic algorithm parameters using hidden Markov models[J].European Journal of Operational Research,2006,175(2):806-820. 被引量:1
  • 5Kullback S,Leibler R A.On information and sufficiency[J].Annuals of Mathematical Statistics,1951,22(1):79-86. 被引量:1
  • 6De Angelis L,Dias J G.Mining categorical sequences from data using a hybrid clustering method[J].European Journal of Operational Research,2014,234(1):720-730. 被引量:1
  • 7Dempster A P,Laiard N M,Rubin D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society Series B-Methodological,1977,39(1):1-38. 被引量:1
  • 8廖开际,刘其辉,易聪,罗俊勤.基于贝叶斯网的知识集群研究[J].计算机应用研究,2011,28(3):828-830. 被引量:3
  • 9何跃,陈大勇,腾格尔.基于Web数据挖掘的用户浏览兴趣路径研究[J].计算机工程与应用,2012,48(7):106-108. 被引量:5
  • 10陈富赞,刘青,李敏强,寇纪淞.一种基于会话聚类算法的Web使用挖掘方法[J].系统工程学报,2012,27(1):129-136. 被引量:4

引证文献1

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部