基于MapReduce的JP算法设计与实现被引量：6

Design and Implementation of JP Algorithm Based on MapReduce

下载PDF

导出

摘要针对大规模文本聚类分析所面临的海量、高维、稀疏等难题,提出一种基于云计算的海量文本聚类解决方案。选择经典聚类算法Jarvis-Patrick(JP)作为案例,采用云计算平台的MapReduce编程模型对JP聚类算法进行并行化改造,利用搜狗实验室提供的语料库在Hadoop平台上进行实验验证。实验结果表明,JP算法并行化改造可行,且相对于单节点环境,该算法在处理大规模文本数据时具有更好的时间性能。 This paper analyzes the prevalent problems such as massiveness,high-dimension and sparse of feature vector of the ordinary algorithms in clustering textual data,then proposes a massive text clustering based on cloud computing technology as a feasible solution.The classical Jarvis-Patrick（JP） algorithm is chosen as a case.It is implemented using MapReduce programming mode and is testified on the cloud computing platform-Hadoop with Sogou corpus provided by Sogou laboratory.Experimental results indicate that the JP algorithm can be paralleled in MapReduce framework and paralled algorithm can handle massive textual data and get a better time performance than single-node environment.

作者曹泽文周姚

机构地区国防科学技术大学信息系统与管理学院

出处《计算机工程》 CAS CSCD 2012年第24期14-16,20,共4页 Computer Engineering

关键词文本挖掘聚类分析文本聚类海量数据云计算并行数据挖掘 text mining clustering analysis text clustering massive data cloud computing parallel data mining

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献10

1Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2008, 51(1): 107-113. 被引量：1
2江小平,李成华,向文,张新访.云计算环境下朴素贝叶斯文本分类算法的实现[J].计算机应用,2011,31(9):2551-2554. 被引量：21
3陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,29(9):2562-2567. 被引量：931
4陈康,郑纬民.云计算:系统实例与研究现状[J].软件学报,2009,20(5):1337-1348. 被引量：1311
5Liu Yang, Li Maozhen, Hammoud S, et al. A MapReduce-based Distributed LSI[C]//Proc. of the 7th International Conference on Fuzzy Systems and Knowledge Discovery. Yantai, China: SIAM Press, 2010. 被引量：1
6Jarvis R A, Patrick E A. Clustering Using a Similarity Measure Based on Shared Nearest Neighbors[J]. IEEE Transactions on Computer, 1973, 22(11): 1025-1034. 被引量：1
7Venner J. Pro Hadoop[M]. New York, USA: Apress, Inc., 2009. 被引量：1
8Ertoz L, Steinbach M, Kumar V. A New Shared Nearest Neighbor Clustering Algorithm and Its Application in Workshop on Clustering High Dimensional Data and Its Applications[C]//Proc. of the I st SIAM International Conference on Data Mining. Chicago, USA: IEEE Press, 2001. 被引量：1
9搜狗实验室.互联网语料库[EB/OL].(2012-04-26).http://www.sogou.com/labs/dl/t.html. 被引量：2
10高小平.ImdJct-chinese-analyzer[EB/OL].(2012-04-26).http://WWW.pudll.corn/downloadsl81/sourcecode/chinese/detail841945-html. 被引量：1

二级参考文献74

1Sims K. IBM introduces ready-to-use cloud computing collaboration services get clients started with cloud computing. 2007. http://www-03.ibm.com/press/us/en/pressrelease/22613.wss 被引量：1
2Boss G, Malladi P, Quan D, Legregni L, Hall H. Cloud computing. IBM White Paper, 2007. http://download.boulder.ibm.com/ ibmdl/pub/software/dw/wes/hipods/Cloud_computing_wp_final_8Oct.pdf 被引量：1
3Zhang YX, Zhou YZ. 4VP+: A novel meta OS approach for streaming programs in ubiquitous computing. In: Proc. of IEEE the 21st Int'l Conf. on Advanced Information Networking and Applications (AINA 2007). Los Alamitos: IEEE Computer Society, 2007. 394-403. 被引量：1
4Zhang YX, Zhou YZ. Transparent Computing: A new paradigm for pervasive computing. In: Ma JH, Jin H, Yang LT, Tsai JJP, eds. Proc. of the 3rd Int'l Conf. on Ubiquitous Intelligence and Computing (UIC 2006). Berlin, Heidelberg: Springer-Verlag, 2006. 1-11. 被引量：1
5Barroso LA, Dean J, Holzle U. Web search for a planet: The Google cluster architecture. IEEE Micro, 2003,23(2):22-28. 被引量：1
6Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 1998,30(1-7): 107-117. 被引量：1
7Ghemawat S, Gobioff H, Leung ST. The Google file system. In: Proc. of the 19th ACM Symp. on Operating Systems Principles. New York: ACM Press, 2003.29-43. 被引量：1
8Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In: Proc. of the 6th Symp. on Operating System Design and Implementation. Berkeley: USENIX Association, 2004. 137-150. 被引量：1
9Burrows M. The chubby lock service for loosely-coupled distributed systems. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 335-350. 被引量：1
10Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. In: Proc. of the 7th USENIX Symp. on Operating Systems Design and Implementation. Berkeley: USENIX Association, 2006. 205-218. 被引量：1

共引文献2102

1查伟,孙燕琼,郑继平.基于云测试架构的FIVP解决方案[J].铁路技术创新,2021(S01):82-86.
2林少伟.人工智能法律主体资格实现路径:以商事主体为视角[J].中国政法大学学报,2021(3):165-177. 被引量：5
3胡祖林,肇杰.云计算下的网盘安全[J].计算机产品与流通,2020,0(1):164-164.
4张盛,任伟,王玉,黄金明,陈旭彤.基于Web的重力异常正演建模工具[J].地质论评,2023,69(S01):595-597.
5赵文韬.基于5G技术的黑龙江云计算产业发展[J].电子技术（上海）,2020,49(9):186-187.
6宋东翔,马伽洛伦,王怡然,袁铭举.基于云原生和区块链的高校智能人事系统的研究[J].新一代信息技术,2022,5(6):67-70.
7Longfei He,Mei Xue,Bin Gu.Internet-of-things enabled supply chain planning and coordination with big data services:Certain theoretic implications[J].Journal of Management Science and Engineering,2020,5(1):1-22. 被引量：6
8王娟,沈小平,叶萌.云计算在医疗卫生职业教育信息化发展中的应用探索[J].微型电脑应用,2011(3):42-44. 被引量：5
9吴劲松,陈孚.云计算发展及应用研究[J].广西通信技术,2011(2):9-13. 被引量：5
10王晓光.一种云计算作业管理代理系统[J].有线电视技术,2012,19(6):75-78.

同被引文献57

1江小平,李成华,向文,张新访,颜海涛.k-means聚类算法的MapReduce并行化实现[J].华中科技大学学报（自然科学版）,2011,39(S1):120-124. 被引量：79
2魏桂英,郑玄轩.层次聚类方法的CURE算法研究[J].科技和产业,2005,5(11):22-24. 被引量：12
3毛伟,徐蔚然,郭军.基于n-gram语言模型和链状朴素贝叶斯分类器的中文文本分类系统[J].中文信息学报,2006,20(3):29-35. 被引量：16
4贺玲,吴玲达,蔡益朝.数据挖掘中的聚类算法综述[J].计算机应用研究,2007,24(1):10-13. 被引量：225
5董健康.数据挖掘中CURE聚类算法研究[J].电脑与电信,2007(4):14-15. 被引量：3
6张磊,黄建华.CURE算法在入侵检测系统中的应用研究[J].计算机安全,2007(11):14-16. 被引量：1
7吴春颖,王士同.基于二元语法的N-最大概率中文粗分模型[J].计算机应用,2007,27(12):2902-2905. 被引量：12
8Hansen L K,Salamon P.Neural network ensembles[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1990,12(10):993-1001. 被引量：1
9Robert E S.Theoretical views of boosting[C]//Proceedings of European Conference on Computational Learning Theory.Nordkirchen,Germany,Springer-Verlag,1999:1-10. 被引量：1
10陈国良.并行计算:结构·算法·编程[M].高等教育出版社,2011. 被引量：1

引证文献6

1原旭,陈志奎,赵亮,杨德礼.一种基于Hadoop的改进减法聚类算法[J].微电子学与计算机,2015,32(3):151-155. 被引量：1
2李杨,杨宝华,李双.BP-AdaBoost分类算法的MapReduce并行化实现[J].计算机应用与软件,2014,31(8):261-264. 被引量：1
3赵艳萍,徐胜超.基于云计算与非负矩阵分解的数据分级聚类[J].现代电子技术,2018,41(5):56-60. 被引量：9
4江俊,黄骅,任条娟,张登辉.基于峰值密度聚类的电信业投诉热点话题检测方法[J].电信科学,2019,35(5):97-103. 被引量：1
5余胜辉,李玲娟.基于Spark的层次聚类算法的并行化研究[J].计算机技术与发展,2020,30(6):19-22. 被引量：6
6段建民,冉旭辉,李帅印,管越.基于改进JP算法的激光雷达可行驶区域检测[J].应用激光,2020,40(3):519-525. 被引量：8

二级引证文献26

1严哲,周斌雄,张祥燊,吴君雄.Spark计算框架在敏感地理信息检测中的应用研究[J].江西测绘,2021(1):46-49.
2陈建煊.利益相关者管理[J].经济管理,2000,26(4):58-58. 被引量：3
3马慧,赵捧未,王婷婷.语义减法聚类研究[J].计算机工程与科学,2016,38(9):1924-1929.
4林宁.云数据处理技术在特种设备监督管理平台运用[J].设备管理与维修,2018(15):11-12. 被引量：1
5陈琳,叶阳,董春旺,何华锋.基于嗅觉可视化技术的工夫红茶发酵程度判定方法[J].茶叶科学,2017,37(3):258-265. 被引量：16
6吕国,肖瑞雪,白振荣,孟凡兴.大数据挖掘中的MapReduce并行聚类优化算法研究[J].现代电子技术,2019,42(11):161-164. 被引量：21
7甘井中,黄恒杰.非负矩阵分解在数据优化中的研究[J].电脑知识与技术,2019,15(6Z):12-13.
8申燕萍,顾苏杭,郑丽霞.基于云计算平台的仿生优化聚类数据挖掘算法[J].计算机科学,2019,46(11):247-250. 被引量：25
9许琴,金晶,邱燕,朱涛.基于云存储技术的手术室数据管理系统[J].自动化与仪器仪表,2020,0(2):97-100. 被引量：4
10纪兆华,王立东,宋海英,张小华.基于Spark的聚类算法探讨[J].科学技术创新,2020(19):29-30.

1姜超,耿则勋,娄博,魏小峰,沈忱.基于GPU的SIFT特征匹配算法并行处理研究[J].计算机科学,2013,40(12):295-297. 被引量：4
2厉旭杰.GPU加速的图像匹配技术[J].计算机工程与应用,2012,48(2):173-176. 被引量：12
3吴春生,冯才刚,迟学斌.基于细节特征点的掌纹比对算法及GPU加速[J].中国科学院大学学报（中英文）,2015,32(4):571-576.
4李卫平,沈海斌.基于接近函数的线性判别分析算法研究[J].电子技术（上海）,2017,46(2):5-7. 被引量：3
5帕特里克.波让.工业4.0与大数据的技术融合[J].机器人产业,2016(6):38-41. 被引量：1
6林晓,张晓煜,马利庄.基于缝裁剪和变形的图像缩放方法[J].计算机科学,2015,42(9):289-292. 被引量：2
7熊燕舞,陆静.“自动识别”中国市场——访美国易腾迈公司总裁兼首席执行官Patrick Byrne[J].运输经理世界,2009(4):37-39.
8英特尔架构关键业务平台应用论坛在京举行[J].中国金融电脑,2012(1):94-94.
9王伟.市场低谷另类发展机遇——专访美国美信集成产品公司多媒体产品总监Patrick Long先生[J].电子技术应用,2008,34(12):10-10.
10外刊速览[J].软件和信息服务,2011(6):6-6.

计算机工程

2012年第24期

浏览历史

内容加载中请稍等...

基于MapReduce的JP算法设计与实现被引量：6

参考文献10

二级参考文献74

共引文献2102

同被引文献57

引证文献6

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

基于MapReduce的JP算法设计与实现 被引量：6

参考文献10

二级参考文献74

共引文献2102

同被引文献57

引证文献6

二级引证文献26

相关作者

相关机构

相关主题

浏览历史

基于MapReduce的JP算法设计与实现被引量：6