摘要
基于大数据挖掘的数据样本多样性与实时性需求,提出了基于分布式计算框架的大数据机器学习系统,对目前算法迭代计算的过程进行分析,以模型向量该变量使迭代的过程划分成为微调与粗调不同阶段,并且还能够发现在部分阶段中大部分样本对于计算结果并没有太大的影响,所以能够在迭代过程中直接使用上次迭代计算结果,降低计算量,使计算效率得到提高。通过实验结果表示,算法基于分布式集群环境中能够降低模型训练计算量,并且提高训练模型精准度,使大数据挖掘实时性得到提高。
Based on the diversity of data samples and real-time requirements of large data mining,a large data machine learning system based on distributed computing framework is proposed. The process of iteration calculation of current algorithms is analyzed. The variable of model vector makes the iteration process divided into different stages of fine-tuning and coarse-tuning,and it can also be used. It is found that most of the samples in some stages have little influence on the calculation results,so the last iteration results can be used directly in the iteration process,which reduces the amount of calculation and improves the calculation efficiency. The experimental results show that the algorithm based on distributed cluster environment can reduce the computational load of model training,improve the accuracy of training model,and improve the real-time performance of large data mining.
作者
潘世成
郑国标
赵耀
PAN Shi-cheng;ZHENG Guo-biao;ZHAO Yao(Qingyuan Power Supply Bureau,Guangdong Qingyuan Power Supply Bureau,Qingyuan 511515,China;Guangzhou Zhongsoft Information Technology Co.,Ltd.,Guangzhou 510665,China)
出处
《电子设计工程》
2020年第11期79-83,共5页
Electronic Design Engineering
关键词
分布式计算框架
大数据
机器学习
迭代计算
distributed computing framework
large data
machine learning
iterative calculation