摘要
在目前数据爆发的互联网时代,论坛舆论走向对于社会的影响越来越大,对舆论进行监控引导已经不可避免,在数据如此巨大的环境中,有效地监控舆论信息成为一个难题。论坛网页中标题、内容等关键信息是舆论监控中的主要以及重点信息。为了提取论坛网页中的标题、内容、作者等与舆情相关的信息,文章提出了一种基于VIPS算法和智能模糊字典匹配相结合的网页内容提取方法。VIPS算法是利用Web页面的视觉提示背景颜色,字体的颜色和大小,边框、逻辑块和逻辑块之间的间距等,结合DOM树进行页面语义分块。智能模糊字典采用AC-BM匹配算法把VIPS分块的语义块与数据库里的标签相匹配,提取出匹配正确的字段。两者的结合可以提取出帖子的标题、内容、作者、发帖时间等信息。该方法具体步骤是首先利用VIPS算法将网页页面块进行提取,再用分隔条检测设置分隔条,然后重构语义块,检测后将分割后的网页保存为xml格式文件,再将xml文件中的语义块与字典进行匹配,提取出匹配成功的内容。最后,文章通过实验证明了该方法的有效性。
In the age of data explosion, the consensustowardsare very important to the society. It is necessaryto monitor and guide the towards of the consensus, in the environment of the big data, it's a different problem that how to monitor the consensus effectively. In order to extra the title, content, author, time information of the BBS webpage.The paper introduces the method based on VIPS algorithm and intelligent fuzzy dictionary.VIPS uses the vision information such as background, font color, font size, border, margin and DOM tree to get semantic block. The intelligent fuzzy dictionary matches the semantic block to the tag name in database using AC-BM algorithm, and get the matched fields. Then the tow combinativemethod can extract the key messages .That method first uses VIPS algorithm to divide webpage in blocks, reconstructs semantic block, saves to a xml file, then matches the semantic block in xml file with the dictionary, extracts the matching content. This paper proves the validity of this method through the experiment.
出处
《信息网络安全》
2014年第10期49-53,共5页
Netinfo Security
基金
教育部高等学校博士学科点专项科研基金[20110181120009]
关键词
信息提取
VIPS算法
智能字典
AC-BM算法
information extraction
VIPS algorithm
intelligent dictionary
AC-BM algorithm