期刊文献+

采用LDA主题模型的多种类型文献混合自动分类研究 被引量:8

A Study of Mixed Automatic Categorization of Multi-type Document Adopting LDA Model
下载PDF
导出
摘要 探索对多种类型文献进行混合分类组织时LDA主题模型的可行性及优越性。以图书、期刊、网页等不同类型的馆藏文献作为实验对象,分别采用LDA主题模型与VSM模型对实验材料进行建模,采用SVM算法实现文本混合自动分类。仿真实验表明:LDA主题模型相对VSM模型具有一定优势,混合自动分类准确率最大差距达19.9%;图书与学术性期刊、网页与非学术性期刊之间的混合分类效果较好,分类准确率可达72%以上。实验证明LDA主题模型对实现多种类型文献统一组织具有较高的可行性和适用性。 The paper explores the feasibility and superiority of using LDA model to categorize and organize muhiple types of document. Selecting books, journals and web pages as the experimental objeets, the authors model the experimental materials with LDA model and VSM model respectively, and use algorithm SVM to realize the mixed automatic text classification. The simulation experiment results show that LDA model have quite a few advantages over traditional VSM model, with a largest difference of 19.9% in the accuracy of mixed automatie text classification; mixed classification performs better between books and academic journals, and between web pages and non-academic journals, with the accuracy of above 72%. Thus, it is proved that LDA model has a high feasibility and usability tor organizing multiple types of document uniformly.
出处 《图书馆论坛》 CSSCI 北大核心 2015年第1期74-80,共7页 Library Tribune
关键词 LDA模型 混合分类 多种类型文献 数字图书馆 LDA model mixed categorization multiple types of document digital library
  • 相关文献

参考文献20

  • 1曾红岩著..数字化学术信息资源利用[M].成都:西南交通大学出版社,2011:218.
  • 2马张华,侯汉清编著..文献分类法主题法导论[M].北京:北京图书馆出版社,1999:393.
  • 3BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation [J]. Journal of Machine Learning Research, 2003 (3): 993-1022. 被引量:1
  • 4徐宽,任河.数字资源长期保存的内容价值判断依据研究[J].图书情报工作,2013,57(13):72-75. 被引量:11
  • 5薛春香,夏祖奇,侯汉清.基于语料和基于标引经验的自动分类模式比较[J].南京农业大学学报(社会科学版),2005,5(4):85-92. 被引量:10
  • 6Pong, J.Y.-H., Kwok, P-.C.-W., et al. A Com- pa- rative study of two automatic document classification methods in a Library setting [J]. Journal of Information Science, 2008 (2): 213-230. 被引量:1
  • 7Arash Joorabchi, Abdulhussain E. Mahdi. An unsuper- vised approach to automatic classification of scientific lit- erature utilizing bibliographic metadata [J]. Journal of In- formationScience, 2011 (5) : 499-514. 被引量:1
  • 8白振田..基于向量空间模型与规则匹配相结合的文本层次分类系统的研究[D].南京农业大学,2006:
  • 9孟海涛,陈思,周睿.基于LDA模型的WEB文本分类[J].盐城工学院学报(自然科学版),2009,22(4):56-59. 被引量:2
  • 10项珑.基于特征提取和主题模型的文本分类研究[D].合肥:安徽大学,2014:29-32. 被引量:1

二级参考文献64

  • 1张普.关于大规模真实文本语料库的几点理论思考[J].语言文字应用,1999(1):35-44. 被引量:49
  • 2郭家义,吴振新.基于资源类型的数字资源长期保存问题研究[J].中国图书馆学报,2005,31(3):47-50. 被引量:21
  • 3张晓娟.论数字图书馆[J].图书情报知识,1996,13(1):2-7. 被引量:115
  • 4Konstantin Tretyakov. Machine Learning Techniquesin Spam Filtering[ A]. Data Mining Problem -Oriented Seminar, MTAT. 03,177, May 2004:60 - 79. 被引量:1
  • 5Nello C, John S T. An Introduction to Support Vector Machines and Other Kernel - based Learning Methods [ M ]. Cambridge : Cambridge University Press ,2000. 被引量:1
  • 6Wegelin J A. A Survey of Partial Least Squares(PLS) Methods,with Emphasis On the Two - block Case[ R]. Seattle:Department of Statistics, University of Washington,2000:21 - 28. 被引量:1
  • 7Hosku 1 dsson A. PLs regressiOn methods [ J ]. Journal of Chemo metrics, 1988,3 ( 2 ) : 211 - 228. 被引量:1
  • 8Xiaogang Wang, Eric Grimson. Spatial Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems (NIPS2007). 2007 [ EB/OL]. Http ://books. nips. cc/papers/files/nips20/NIPS2007_0964, pdf. 被引量:1
  • 9McCallum A ,Corrada- Emmanuel A,Wang X. Topic and role discovery, in social networks[ A]. Proceedings of 19th Joint conference on artificial intelligence. 2005. 被引量:1
  • 10Thorsten Brants, Francine Chen, Ioannis Tsocbantaridis. Topic - based document segmentation with probabilistic latent semantic analysis [ A]. Proceedings of the eleventh international Conference on hfformation and knowledge management McLean, Virginia, USA. 2002.211 - 218. 被引量:1

共引文献53

同被引文献177

引证文献8

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部