摘要
为了快速在浩如烟海的网页里定位用户感兴趣的信息,提出基于Hadoop的网页文本聚类的算法,网页文本以key,value形式存储到分布式文件系统(HDFS),使用基于统计的方法进行分词操作,去噪,特征提取,构建向量空间模型,提出基于Map Reduce实现改进的k-means聚类算法。经实验验证,不同规模的数据集实现分布式计算,数据集越大,聚类效果越好。
In order to find the information what user interestingin over the multitude web pages quickly,A text clustering algor-ithm based on Hadoop is proposed.The page text is stored in the form of key,value to the distributed file system(HDFS),using statistical method for word segmentation operation,removing the noise’Feature extraction and construction of vector space model.Improved k-means clustering algorithm based on MapReduce.Byexperimental verification,Distributed computing with differ-ent scale data sets,The greater the data set,the better the clustering effect.
作者
尹铁源
张瑞琴
Ying Tieyuan;Zhang Ruiqin(School of information Science and Engineering,Shenyang University of Technology,shenyang 110000,China)
出处
《信息通信》
2018年第4期32-34,共3页
Information & Communications