摘要
针对现有分布式数据关联性深度挖掘方法存在的挖掘质量差、内存占比高等问题,提出一种基于随机森林的分布式数据关联性深度挖掘方法。通过设置Pearson线性相关系数与Spearman秩相关系数,得到数据关联性的深度挖掘参数,根据挖掘参数的取值设置关联性判定标准。在分布式系统中收集并处理初始数据,利用随机森林算法划分分布式数据类型,在考虑数据维度关联性的情况下,计算Pearson线性相关系数和Spearman秩相关系数的具体取值,得出最终的分布式数据关联性深度挖掘结果。通过实例分析得出结论:关联性挖掘偏离程度低于0.1,满足对数据关联性挖掘质量的要求,同时,降低了挖掘方法运行程序的内存占用量。
Aiming at the problems of poor mining quality and high memory ratio in existing distributed data asso⁃ciation deep mining methods,this paper proposed a distributed data association deep mining method based on random forest.By setting the Pearson linear correlation coefficient and the Spearman rank correlation coefficient,the deep mining parameters of the data association were obtained,and the relevance judgment standard was set according to the values of the mining parameters.The initial data was collected and processed in the distributed system,the distributed data types were divided by the random forest algorithm,and the specific values of Pearson linear correlation coefficient and Spearman rank correlation coefficient were calculated considering the dimension correlation of data,so as to obtain the final deep mining results of distributed data correlation.Through the exam⁃ple analysis,it is concluded that the deviation degree of correlation mining is less than 0.1,which meets the re⁃quirement of the quality of data correlation mining and reduces the memory consumption of running program of the mining method.
作者
吕立新
LYU Li-xin(School of Information and Artificial Intelligence,Anhui Business College,Wuhu 241000,China)
出处
《内蒙古民族大学学报(自然科学版)》
2023年第4期308-314,共7页
Journal of Inner Mongolia Minzu University:Natural Sciences
基金
安徽省高校优秀青年人才支持计划项目(gxyq2018236)。