摘要
PageRank是Web结构挖掘的经典算法,已在Google搜索引擎中取得了巨大成功。但其迭代次数多,时空消耗大,执行速度和收敛速度都还较慢。文中详细讨论了Hadoop-MapReduce的执行流程及其内部实现机制后,提出了一种并行MapReduce实现矩阵分块的PageRank算法,其实质是减少MapReduce框架结构中Map阶段和Reduce阶段的迭代次数,从而减少时空开销。最后搭建Hadoop-MapReduce开源平台,模拟Web结构爬取,比较了传统算法和改进算法的性能。结果表明,改进后的算法迭代次数低,并行效率较高,在模拟环境中PageRank标识网页等级显示其优越性。
PageRank is the classical algorithm of Web structure mining,already has been a huge success in Google search engine.But the more iterative times,the more space-time consumption,execution speed and convergence speed are slower.Put forward a kind of parallel MapReduce framework,realize matrix partition using PageRank algorithm,as a matter of fact substance is the iterations of reducing MapReduce frame structure in Map and Reduce phase,thus reducing space-time overhead.Finally build Hadoop-MapReduce open-source platform,simulate Web structure climb taking,the performance in traditional algorithm and improved algorithm is compared.Results show the improved algorithm has lower iteration times,higher parallel efficiency,using PageRank identification shows its superiority in the simulation environment.
出处
《计算机技术与发展》
2011年第8期6-9,13,共5页
Computer Technology and Development
基金
云南省自然科学基金(2007F174M)
云南大学研究生科研课题资助项目(ynny200928)