期刊文献+

一种融合多种信息的Web文档分类方法 被引量:1

A method of Web document classification based on fusion of various information
下载PDF
导出
摘要 当前的Web文档分类方法大多以正文的文本分类为基础,没有很好地利用网页中所蕴含的多种信息。为提高Web文档的分类精度,提出一种融合Web文档中多种信息(如正文、描述信息、关键字、图片相关文本、标题以及文章中加粗等特殊字体)的文本分类方法。鉴于不同信息对于分类的贡献不尽相同,采用遗传算法给各种信息设置合适的权重,最终采用支持向量机对Web文档进行分类。实验结果表明,与仅使用正文文本进行分类的方法相比,所提出的融合多种信息的方法能有效提高分类精度。 Most of the current Web document classification methods are based on text classification of the body text,and do not make good use of various information contained in the Web pages. In order to improve the classification accuracy of Web documents,this paper proposes a Web documents classification method utilizing various kinds of information,such as body text,description information,keywords,text related to the image,titles,and other special bold font text,etc. Since contributions of different information to the classification are different,we use genetic algorithm to set appropriate weights for all kinds of information,and finally use the support vector machine to classify the Web documents. Experimental results show that,compared with the method using only the body text to classify,the proposed fusion method can effectively improve the classification accuracy.
作者 段国仑 谢钧 郭蕾蕾 王晓莹 Duan Guolun1 ,Xic Jun1, Guo Lcilci2, Wang Xiaoying1(1. Institute of Command Control Engineering, Army Engineering University of PLA, Nanjing 210007, China ; 2. Institute of Communications Engineering, Army Engineering University of PLA, Nanjing 210007, Chin)
出处 《信息技术与网络安全》 2018年第6期76-79,共4页 Information Technology and Network Security
关键词 WEB文档分类 信息融合 遗传算法 支持向量机 Web document classification information fusion genetic algorithm support vector machine
  • 相关文献

参考文献6

二级参考文献24

  • 1黄萱青 吴立德.独立于语种的文本分类方法[M].,2000.37-43. 被引量:1
  • 2鲁松 白硕 等.文本中词语权重计算方法的改进[M].,2000.31-36. 被引量:1
  • 3卜东波.聚类/分类理论研究及其在大模型文本挖掘的应用:博士论文[M].,2000.. 被引量:1
  • 4[2]D.D.Lewis,Challenges in Machine Learning for Text Classification.The 9th Annual Conference on Computational Learning Theory.Italy,1996. 被引量:1
  • 5[3]D.D.Lewis,Representation and Learning in Infor mation Retrieval,Doctoral Thesis,1992. 被引量:1
  • 6[4]F.Debole,ESebastiani,An Analysis of the Relative Hardness of Reuters-21578 Subsets.Journal of the American Society for Information Science and Technology,Vol.56,No.6,2005. 被引量:1
  • 7[5]F.Sebastiani,Machine Learning in Automated Text Categorization,ACM Computing Surveys,Vol.34,No.1,March 2002,pp.1-47. 被引量:1
  • 8[6]K.Aas,L.Eikvil,Text Categorisation:A Survey.http://www.nr.no/files/samba/bamg/tm_survey.ps. 被引量:1
  • 9[7]R.E.Schapire,Y.Singer.Improved Boosting Algorithms Using Confidence-Rated Predictions.Machine Learning,Vol.37,No.3,pp.297-336,1999. 被引量:1
  • 10[8](美)Tom M.Mitchell著,曾华军,张银奎,等译.机器学习[M].机械工业出版社,2002. 被引量:1

共引文献308

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部