摘要
针对大数据集下文本分类算法在单机上实现效率低下的问题,提出基于GPU(graphic processing unit)和MapReduce技术的双重并行计算的云计算框架。通过构造双重并行计算的自适应计算过程,结合TFIDF(term frequency inverse document frequency)改进算法的特点,实现基于双重并行自适应计算模型的改进TFIDF算法。实验中,在不同的运行环境下对改进TFIDF算法的运行效率进行对比分析,比较不同计算节点下算法的执行效率,实验结果表明,改进TFIDF算法可实现对海量数据的高速有效处理,随着节点数量的增加,双重并行自适应计算下,算法执行效率更加高效。
Text classification algorithm achieves the low efficiency for the large data sets on the stand-alone.The double parallel cloud computing framework based on GPU and MapReduce was put forward.The improved TFIDF text categorization algorithm with double parallel adaptive computing was realized by constructing the adaptive computation process of double parallel computing and combining the advantage of improved TFIDF algorithm.In the experiment,the efficiency of improved TFIDF algorithm was compared in different operating environments.The algorithm execution efficiency was also compared with different computing nodes in the meantime.The results show that massive data can be processed in high-speed and effectively using improved TFIDF algorithm adopting double parallel adaptive computing.With the increase of the number of nodes,the algorithm execution efficiency with double parallel adaptive computing is more effective.
作者
孙玉强
巢碧霞
SUN Yu-qiang CHAO Bi-xia(School of Information Science and Engineering, Changzhou University, Changzhou 213164, China)
出处
《计算机工程与设计》
北大核心
2016年第11期3016-3021,共6页
Computer Engineering and Design