期刊文献+

基于VIPS算法和模糊字典匹配的网页提取技术研究 被引量:4

Research on the Technology of Webpage Extraction Based on VIPS and Vague Dictionary
下载PDF
导出
摘要 在目前数据爆发的互联网时代,论坛舆论走向对于社会的影响越来越大,对舆论进行监控引导已经不可避免,在数据如此巨大的环境中,有效地监控舆论信息成为一个难题。论坛网页中标题、内容等关键信息是舆论监控中的主要以及重点信息。为了提取论坛网页中的标题、内容、作者等与舆情相关的信息,文章提出了一种基于VIPS算法和智能模糊字典匹配相结合的网页内容提取方法。VIPS算法是利用Web页面的视觉提示背景颜色,字体的颜色和大小,边框、逻辑块和逻辑块之间的间距等,结合DOM树进行页面语义分块。智能模糊字典采用AC-BM匹配算法把VIPS分块的语义块与数据库里的标签相匹配,提取出匹配正确的字段。两者的结合可以提取出帖子的标题、内容、作者、发帖时间等信息。该方法具体步骤是首先利用VIPS算法将网页页面块进行提取,再用分隔条检测设置分隔条,然后重构语义块,检测后将分割后的网页保存为xml格式文件,再将xml文件中的语义块与字典进行匹配,提取出匹配成功的内容。最后,文章通过实验证明了该方法的有效性。 In the age of data explosion, the consensustowardsare very important to the society. It is necessaryto monitor and guide the towards of the consensus, in the environment of the big data, it's a different problem that how to monitor the consensus effectively. In order to extra the title, content, author, time information of the BBS webpage.The paper introduces the method based on VIPS algorithm and intelligent fuzzy dictionary.VIPS uses the vision information such as background, font color, font size, border, margin and DOM tree to get semantic block. The intelligent fuzzy dictionary matches the semantic block to the tag name in database using AC-BM algorithm, and get the matched fields. Then the tow combinativemethod can extract the key messages .That method first uses VIPS algorithm to divide webpage in blocks, reconstructs semantic block, saves to a xml file, then matches the semantic block in xml file with the dictionary, extracts the matching content. This paper proves the validity of this method through the experiment.
出处 《信息网络安全》 2014年第10期49-53,共5页 Netinfo Security
基金 教育部高等学校博士学科点专项科研基金[20110181120009]
关键词 信息提取 VIPS算法 智能字典 AC-BM算法 information extraction VIPS algorithm intelligent dictionary AC-BM algorithm
  • 相关文献

参考文献11

二级参考文献101

共引文献137

同被引文献35

  • 1RFC3562 ( 2003 ), Key Management Considerations for the TCP MD5 Signature Option[S]. 被引量:1
  • 2P,.FC1321 ( 1992 ), The MD5 Message-Digest Algorithm[S]. 被引量:1
  • 3RFC1771 ( 1995 ), A Border Gateway Protocol 4 (BGPP-4 ) [S]. 被引量:1
  • 4Kontaxis G, Athanasopoulos E, Portokalidis G, et al. Sauth: Protecting user accounts from password database leaks[C]//Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 2013: 187-198. 被引量:1
  • 5Bonneau J, Herley C, Van Oorschot P C, et al. The quest to replace passwords: A framework for comparative evaluation of web authentication schemes[C]// Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 2012: 553-567. 被引量:1
  • 6Herley C, Van Oorschot P. A research agenda acknowledging the persistence of passwords. Security & Privacy, IEEE, 2012, 10(1): 28-36. 被引量:1
  • 7Herley C, Van Oorschot P. A research agenda acknowledging the persistence of passwords[J]. Security & Privacy, IEEE, 2012, 10(1): 28-36. 被引量:1
  • 8Herley C, van Oorschot P C, Patrick A S. Passwords: If we' re so smart, why are we still using them?[M]. Financial Cryptography and Data Security. Springer Berlin Heidelberg, 2009: 230-237. 被引量:1
  • 9He W, Akhawe D, Jain S, et al. ShadowCrypt: Encrypted Web Applications for Everyone[C]//Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2014: 1028-1039. 被引量:1
  • 10Zhao R, Yue C. Toward a secure and usable cloud-based password manager for web browsers[J]. Computers & Security, 2014, 46: 32-47. 被引量:1

引证文献4

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部