摘要
当前的Web文档分类方法大多以正文的文本分类为基础,没有很好地利用网页中所蕴含的多种信息。为提高Web文档的分类精度,提出一种融合Web文档中多种信息(如正文、描述信息、关键字、图片相关文本、标题以及文章中加粗等特殊字体)的文本分类方法。鉴于不同信息对于分类的贡献不尽相同,采用遗传算法给各种信息设置合适的权重,最终采用支持向量机对Web文档进行分类。实验结果表明,与仅使用正文文本进行分类的方法相比,所提出的融合多种信息的方法能有效提高分类精度。
Most of the current Web document classification methods are based on text classification of the body text,and do not make good use of various information contained in the Web pages. In order to improve the classification accuracy of Web documents,this paper proposes a Web documents classification method utilizing various kinds of information,such as body text,description information,keywords,text related to the image,titles,and other special bold font text,etc. Since contributions of different information to the classification are different,we use genetic algorithm to set appropriate weights for all kinds of information,and finally use the support vector machine to classify the Web documents. Experimental results show that,compared with the method using only the body text to classify,the proposed fusion method can effectively improve the classification accuracy.
作者
段国仑
谢钧
郭蕾蕾
王晓莹
Duan Guolun1 ,Xic Jun1, Guo Lcilci2, Wang Xiaoying1(1. Institute of Command Control Engineering, Army Engineering University of PLA, Nanjing 210007, China ; 2. Institute of Communications Engineering, Army Engineering University of PLA, Nanjing 210007, Chin)
出处
《信息技术与网络安全》
2018年第6期76-79,共4页
Information Technology and Network Security
关键词
WEB文档分类
信息融合
遗传算法
支持向量机
Web document classification
information fusion
genetic algorithm
support vector machine