期刊文献+

基于MapReduce的JP算法设计与实现 被引量:6

Design and Implementation of JP Algorithm Based on MapReduce
下载PDF
导出
摘要 针对大规模文本聚类分析所面临的海量、高维、稀疏等难题,提出一种基于云计算的海量文本聚类解决方案。选择经典聚类算法Jarvis-Patrick(JP)作为案例,采用云计算平台的MapReduce编程模型对JP聚类算法进行并行化改造,利用搜狗实验室提供的语料库在Hadoop平台上进行实验验证。实验结果表明,JP算法并行化改造可行,且相对于单节点环境,该算法在处理大规模文本数据时具有更好的时间性能。 This paper analyzes the prevalent problems such as massiveness,high-dimension and sparse of feature vector of the ordinary algorithms in clustering textual data,then proposes a massive text clustering based on cloud computing technology as a feasible solution.The classical Jarvis-Patrick(JP) algorithm is chosen as a case.It is implemented using MapReduce programming mode and is testified on the cloud computing platform-Hadoop with Sogou corpus provided by Sogou laboratory.Experimental results indicate that the JP algorithm can be paralleled in MapReduce framework and paralled algorithm can handle massive textual data and get a better time performance than single-node environment.
作者 曹泽文 周姚
出处 《计算机工程》 CAS CSCD 2012年第24期14-16,20,共4页 Computer Engineering
关键词 文本挖掘 聚类分析 文本聚类 海量数据 云计算 并行数据挖掘 text mining clustering analysis text clustering massive data cloud computing parallel data mining
  • 相关文献

参考文献10

  • 1Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2008, 51(1): 107-113. 被引量:1
  • 2江小平,李成华,向文,张新访.云计算环境下朴素贝叶斯文本分类算法的实现[J].计算机应用,2011,31(9):2551-2554. 被引量:21
  • 3陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,29(9):2562-2567. 被引量:931
  • 4陈康,郑纬民.云计算:系统实例与研究现状[J].软件学报,2009,20(5):1337-1348. 被引量:1311
  • 5Liu Yang, Li Maozhen, Hammoud S, et al. A MapReduce-based Distributed LSI[C]//Proc. of the 7th International Conference on Fuzzy Systems and Knowledge Discovery. Yantai, China: SIAM Press, 2010. 被引量:1
  • 6Jarvis R A, Patrick E A. Clustering Using a Similarity Measure Based on Shared Nearest Neighbors[J]. IEEE Transactions on Computer, 1973, 22(11): 1025-1034. 被引量:1
  • 7Venner J. Pro Hadoop[M]. New York, USA: Apress, Inc., 2009. 被引量:1
  • 8Ertoz L, Steinbach M, Kumar V. A New Shared Nearest Neighbor Clustering Algorithm and Its Application in Workshop on Clustering High Dimensional Data and Its Applications[C]//Proc. of the I st SIAM International Conference on Data Mining. Chicago, USA: IEEE Press, 2001. 被引量:1
  • 9搜狗实验室.互联网语料库[EB/OL].(2012-04-26).http://www.sogou.com/labs/dl/t.html. 被引量:2
  • 10高小平.ImdJct-chinese-analyzer[EB/OL].(2012-04-26).http://WWW.pudll.corn/downloadsl81/sourcecode/chinese/detail841945-html. 被引量:1

二级参考文献74

  • 1Sims K. IBM introduces ready-to-use cloud computing collaboration services get clients started with cloud computing. 2007. http://www-03.ibm.com/press/us/en/pressrelease/22613.wss 被引量:1
  • 2Boss G, Malladi P, Quan D, Legregni L, Hall H. Cloud computing. IBM White Paper, 2007. http://download.boulder.ibm.com/ ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf 被引量:1
  • 3Zhang YX, Zhou YZ. 4VP+: A novel meta OS approach for streaming programs in ubiquitous computing. In: Proc. of IEEE the 21st Int'l Conf. on Advanced Information Networking and Applications (AINA 2007). Los Alamitos: IEEE Computer Society, 2007. 394-403. 被引量:1
  • 4Zhang YX, Zhou YZ. Transparent Computing: A new paradigm for pervasive computing. In: Ma JH, Jin H, Yang LT, Tsai JJP, eds. Proc. of the 3rd Int'l Conf. on Ubiquitous Intelligence and Computing (UIC 2006). Berlin, Heidelberg: Springer-Verlag, 2006. 1-11. 被引量:1
  • 5Barroso LA, Dean J, Holzle U. Web search for a planet: The Google cluster architecture. IEEE Micro, 2003,23(2):22-28. 被引量:1
  • 6Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 1998,30(1-7): 107-117. 被引量:1
  • 7Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the 19th ACM Symp. on Operating Systems Principles. New York: ACM Press, 2003.29-43. 被引量:1
  • 8Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Proc. of the 6th Symp. on Operating System Design and Implementation. Berkeley: USENIX Association, 2004. 137-150. 被引量:1
  • 9Burrows M. The chubby lock service for loosely-coupled distributed systems. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 335-350. 被引量:1
  • 10Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 205-218. 被引量:1

共引文献2102

同被引文献57

引证文献6

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部