摘要
随着信息时代数据量成倍的增长,传统的文本相似度检测方法已经无法处理大规模的文本数据.为此,提出了一种基于Hadoop集群技术的文本相似度仿真检测模型.该检测模型分为三步:第一步,利用Hadoop工具搭建实验平台,并针对该平台进行硬件和软件的优化;第二步,把文档转化为集合,使用改进的基于Map Reduce编程模型的Shingling算法;第三步,提出一种分布式的New Minhash算法求签名矩阵,然后利用Jaccard系数计算出相似度,选出相似的文档.实验证明:对于相同操作,优化后的性能耗时减少了近5.65%.该仿真模型不仅能够更加精确的求出文本相似度,而且能够更好的适应分布式平台处理大规模的文本数据,同时拥有良好的扩展性.
With the increasing amount of data in the information age, traditional text similarity computing method has been unable to deal with large-scale text data, aiming at these problems, this text puts forward a kind of text similarity simulation detection model based on Hadoop cluster technology. The detection model is divided into three steps: the first step is to use the Hadoop tool to build the experimental platform,and the platform for the optimization of hardware and software. The second step to the document into a collection, using an improved Map Reduce based programming model based on Shingling algorithm. In the third step, a distributed New Minhash algorithm is proposed to solve the signature matrix, and then the Jaccard coefficients are used to calculate the similarity. Experiments show that for the same operation, the performance of the optimized time decreased by nearly 5.65%, the simulation model is not only more accurate for text similarity, but also can better adapt to the distributed processing platform for the large-scale text data, and has a good scalability.
出处
《新疆大学学报(自然科学版)》
CAS
北大核心
2017年第3期308-315,共8页
Journal of Xinjiang University(Natural Science Edition)
基金
国家自然科学基金项目(61462011)