摘要
针对大规模文本聚类分析所面临的海量、高维、稀疏等难题,提出一种基于云计算的海量文本聚类解决方案。选择经典聚类算法Jarvis-Patrick(JP)作为案例,采用云计算平台的MapReduce编程模型对JP聚类算法进行并行化改造,利用搜狗实验室提供的语料库在Hadoop平台上进行实验验证。实验结果表明,JP算法并行化改造可行,且相对于单节点环境,该算法在处理大规模文本数据时具有更好的时间性能。
This paper analyzes the prevalent problems such as massiveness,high-dimension and sparse of feature vector of the ordinary algorithms in clustering textual data,then proposes a massive text clustering based on cloud computing technology as a feasible solution.The classical Jarvis-Patrick(JP) algorithm is chosen as a case.It is implemented using MapReduce programming mode and is testified on the cloud computing platform-Hadoop with Sogou corpus provided by Sogou laboratory.Experimental results indicate that the JP algorithm can be paralleled in MapReduce framework and paralled algorithm can handle massive textual data and get a better time performance than single-node environment.
出处
《计算机工程》
CAS
CSCD
2012年第24期14-16,20,共4页
Computer Engineering
关键词
文本挖掘
聚类分析
文本聚类
海量数据
云计算
并行数据挖掘
text mining
clustering analysis
text clustering
massive data
cloud computing
parallel data mining