摘要
针对传统方法存在计算时间较长,任务分配均匀程度较差的问题,提出基于特征加权的分布式大数据相关性挖掘方法。对软子空间进行聚类,根据特征加权的不确定性表示加权聚类中心,并求解权值。设计特征选择的技术框架对特征加权进行选择,依据特征空间搜索机制完成特征筛选。根据特征筛选结果运用MapReduce编程模型对数据簇的聚类中心进行反复扫描,计算样本到聚类中心的距离,去除其中的孤立点。利用Shuffle均衡分组机制计算频繁项集,开始新项的FP树建立及频繁项集挖掘,直至完成所有频繁项集的挖掘。实验结果表明,所提方法的挖掘时间低于传统方法,并且任务分配均衡性较高,说明上述方法具有一定的应用价值。
Obviously, the traditional method has a long computing time and poor uniformity of task allocation. Therefore, a distributed big data correlation mining method based on feature weighting was presented in this paper. Soft subspaces were clustered. Based on the uncertainty of feature weighting, the weighted clustering center was represented to get the solution of weight. The technical framework of feature selection was designed for selecting feature weighting. The search mechanism of feature space was used to complete feature selection. Combined with the results of feature selection, the MapReduce programming model was applied to repeatedly scan the cluster center of the data cluster, and the distance between the sample and the cluster center was calculated to remove the outliers. Shuffle balanced grouping mechanism was used to calculate frequent item-sets, and FP Tree of new items was established to mine frequent item-sets, completing the mining of all frequent item-sets. The results show that the mining time, balance of task allocation, and application value of this method are better than those of traditional methods.
作者
戴惠丽
王敬宇
DAI Hui-li;WANG Jing-yu(Minnan Science and Technology Institute,Quanzhou Fujian 362332,China;Beijing University of Posts and Telecommunications,Beijing 100876,China)
出处
《计算机仿真》
北大核心
2021年第6期282-285,372,共5页
Computer Simulation
基金
2018年福建省高等学校学科(专业)带头人培养计划国内访问学者项目(138)
闽南科技学院一般教改项目(MKJG-2018-017)研究成果。
关键词
特征加权
分布式大数据
相关性挖掘
软子空间聚类
任务分配
Feature weighting
Distributed big data
Correlation mining
Soft subspace clustering
Task allocation