期刊文献+

基于归一化词频贝叶斯模型的文本分类方法 被引量:10

Normalized term frequency Bayes for text classification
下载PDF
导出
摘要 为降低海量文本分类中词频信息和文本长度对分类结果的影响,提出归一化词频的贝叶斯分类模型。基于分布式计算框架MapReduce平台,采用文本中高词频特征的对数平均计算方法进行文本长度的归一化处理,解决朴素贝叶斯模型在文本分类中参数估计的不足。实验结果表明,该方法在分类准确率上优于朴素贝叶斯方法,具有良好的扩展性和伸缩性,能够应用于大数据的文本快速分类。 To reduce the impact of document length and the information of words frequencies on the classification performances,normalized term frequency Bayes was proposed.Based on a distributed computing framework of MapReduce,the logarithm of high word frequency was computed and the text length was normalized to solve rough parameter estimation of Naive Bayes.Experimental results show that the improved method is superior to the Naive Bayesian method on the classification accuracy,and it has good scalability and extensibility,which can be used to classify large-scale data.
作者 张杰 陈怀新
出处 《计算机工程与设计》 北大核心 2016年第3期799-802,共4页 Computer Engineering and Design
关键词 文本分类 朴素贝叶斯 参数估计 词频特征 并行计算 text classification Naive Bayes parameter estimation words frequency parallel computing
  • 相关文献

参考文献15

  • 1何清,李宁,罗文娟,史忠植.大数据下的机器学习算法综述[J].模式识别与人工智能,2014,27(4):327-336. 被引量:330
  • 2Upadhyaya SR. Parallel approaches to machine learning-a comprehensive survey [J]. Journal of Parallel and Distributed Computing, 2013, 73 (3): 284-292. 被引量:1
  • 3Wu W, Li H, Wang H, et al. Probase: A probabilistic taxo- nomy for text understanding [C] //ACM SIGMOD Interna- tional Conference on Management of Data, 2012: 481-492. 被引量:1
  • 4Lo S, Ding L. Probabilistic reasoning on back ground net: An application to text categorization [C] //International Conference on Machine Learning and Cybernetics, 2012: 688-694. 被引量:1
  • 5巩知乐,张德贤,胡明明.一种改进的支持向量机的文本分类算法[J].计算机仿真,2009,26(7):164-167. 被引量:37
  • 6Zeng Y, Yang Y, Zhao L. Pseudo nearest neighbor rule for pattern classification[J]. Expert Systems with Applications, 2009, 36 (2): 3587-3595. 被引量:1
  • 7赵喆,向阳,王继生.基于并行计算的文本分类技术[J].计算机应用,2013,33(A02):60-62. 被引量:4
  • 8White T. Hadoop.. The definitive guide [M]. O' Reilly Media Ine, 2009. 被引量:1
  • 9Fereira CR, Junior TC, Traina AJM, et al. Clustering very large multi-dimensional datasets with MapReduce [C] //17th ACM SIGKDD International Conference on Knowledge Disco- very and Data Mining, 2011: 690-698. 被引量:1
  • 10Kim BJ. A classifier for big data [C] //6th International Conference on Convergence and Hybrid Information Technolo- gy, 2012: 505-512. 被引量:1

二级参考文献95

  • 1搜狐研发中心.搜狗文本分类语料库[EB/OL].2008.http://www.sogou.oom/labs/dl/c.html. 被引量:3
  • 2DEAN J, GHEMAWAT S. MapReduce: simplified data processingon large clusters [ C] // Proceedings of the 6th Symposium on Oper-ating Systems Design and Implementation. San Francisco, CA,USA: USENIX Association, 2004: 137-149. 被引量:1
  • 3YANG Y,PEDERSEN J 0. A comparative study on feature selec-tion in text categorization [ C]// Proceedings of the Fourteenth Inter-national Conference on Machine Learning. San Francisco: MorganKaufmann, 1997: 412 -420. 被引量:1
  • 4FORMAN G. An extensive empirical study of feature selection met-rics for text classification[ J]. Machine Learning Research, 2003,3(1):1289 -1305. 被引量:1
  • 5CORTES C,VAPNIK V. Support-vector networks [ J]. MachineLearning, 1995, 20(3):273 -297. 被引量:1
  • 6VAPNIK V. The nature of the statistical learning theory [ M]. NewYork; Springer, 1999. 被引量:1
  • 7黄陳.支持向量机核函数的研究[D].苏州:苏州大学,2008. 被引量:1
  • 8OSUNA E,FREUND R,GIROSI F. Training support vector ma-chines: an application to face detection [ C] // Proceedings of the1997 IEEE Computer Society Conference on Computer Vision andPattern Recognition. Washington, DC: IEEE Computer Society,1997: 130-136. 被引量:1
  • 9SCHOLKOPF B, BURGES C,SMOLA A J. Advances in kernelmethods - support vector learning [ M]. Cambridge: MIT Press,1999:185 -208. 被引量:1
  • 10LI H G, WU G Q. K-means clustering with bagging and MapReducef CJ// Proceedings of the 2011 44th Hawaii International Conferenceon System Sciences. Washington, DC: IEEE Computer Society,2011: 1 -8. 被引量:1

共引文献368

同被引文献115

引证文献10

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部