摘要
传统KNN查询是一种稳定性和准确率性能均较好的算法,但是在样本规模过大时,算法的计算效率受到影响较大,对此提出一种基于聚类中心文本串联的并行(Mapreduce for KNN,MKNN)文本分类算法。首先,基于文本聚类方式,对相似度较高的文档进行串联合并,并以合并文档取代原有独立文档进行KNN查询过程,可有效实现文本相似度指标计算量降维;其次,针对上述文本串联及KNN查询过程,构建基于Mapreduce算法的并行化KNN执行过程,实现算法计算效率的快速提升;最后,通过与同类单线程算法在文本分类精度和算法计算效率实验上对比显示,在保证足够精度前提下,所提算法分类速度可得到有效提升。
The traditional KNN query is a kind of algorithm with stability and accuracy properties, but when the sample size is too large, the computational efficiency of the algorithm is greatly affected, this paper presents a kind of parallel MKNN algorithm for text classification based on clustering center text series. Firstly, based on the text clustering method, the high degree of similarity of the document is merged in series, and the merged document is used to replace the original independent document KNN query process, which can effectively reduce the text similarity index calculation; Secondly, the parallel KNN algorithm for text series process is constructed based on the Mapreduce algorithm and the KNN algorithm, which could further improve the calculation efficiency of the algorithm; Finally, by comparing with the similar single threaded algorithm in the text classification accuracy and computational efficiency of the algorithm, the results show that the proposed algorithm can effectively improve the classification speed under the premise of sufficient accuracy.
作者
董博
王雪
DONG Bo;WANG Xue(School of Innovation and Entrepreneurship;Information Technology Center, Liaoning University, Shenyang 110036, China)
出处
《控制工程》
CSCD
北大核心
2018年第6期1012-1018,共7页
Control Engineering of China
基金
辽宁省教育厅科技项目(LYB201620)
国家档案局科技项目(2016-X-25)
辽宁省档案局科技项目(L-2016-R-6,L-2016-R-8,L-2017-X-7)
2017辽宁大学“大学生创新创业训练计划”(x201710140136
x201710140333)