摘要
采用分布式编程MapReduce模型研究了文本统一格式预处理、训练、测试以及分类等基于朴素贝叶斯文本分类算法主要计算过程的MapReduce并行化方法,并在Hadoop云计算平台进行了实验。实验结果表明:朴素贝叶斯文本分类算法MapReduce并行化后在Hadoop云计算平台上部署运行,具有较好的加速比,对中文网页文本分类识别率达到了86%。
The major procedures of text classification such as uniform text format expression, training, testing and classifying based on Naive Bayesian text classification algorithm were implemented using MapReduce programming mode. The experiments were given in Hadoop cloud computing environment. The experimental results indicate basically linear speedup with an increasing number of node computers. A recall rate of 86% was achieved when classifying Chinese Web pages.
出处
《计算机应用》
CSCD
北大核心
2011年第9期2551-2554,2566,共5页
journal of Computer Applications
基金
中央高校基本科研业务费专项资金资助项目(CZY11002)
武汉市科技攻关项目(201110821229)
工信部国家科技重大专项(2011ZX03002-001-01)