摘要
随着大数据时代的到来,K最近邻(KNN)算法较高的计算复杂度的弊端日益凸显。在深入研究了KNN算法的基础上,结合Map Reduce编程模型,利用其开源实现Hadoop,提出了一种基于Map Reduce和分布式缓存机制的KNN并行化方案。该方案只需要通过Mapper阶段就能完成分类任务,减少了Task Tracker与Job Tracker之间的通信开销,同时也避免了Mapper的中间结果在集群任务节点之间的通信开销。通过在Hadoop集群上实验,验证了所提出的并行化KNN方案有着优良的加速比和扩展性。
With the advent of the era of big data, K-nearest neighbor algorithm's shortcoming which high computational complexity is become more and more seriously. Through the use of distributed cache mechanism and Hadoop programming ideas provided, this paper proposed KNN parallelization scheme based on the MapReduce. The program only needs to complete classification tasks by Mapper stage. It reduced the communication overhead between the TaskTracker and JobTraeker; on the other hand, it avoided the intermediate results Mapper overhead communication and information transfer between nodes in the cluster task. Through experiments on a Hadoop cluster, the proposed parallel KNN has a better speedup and sealability.
出处
《微型机与应用》
2015年第2期18-21,共4页
Microcomputer & Its Applications