摘要
针对大数据由于数据复杂性、异构性、安全性、可伸缩性和大规模数据量而难以预测分析的问题,提出了基于增强可伸缩随机森林(Enhancing Scalable Random Forest,ESRF)的高维大数据预测分析系统.该系统通过在训练数据集上执行超参数优化来提升可伸缩随机森林(Scalable Random Forest,SFR),然后对预处理数据应用主成分分析(Principal Component Analysis,PCA)和信息增益(Information Gain,IG),对不影响模型的特征进行缩减以减少模型开发阶段的处理时间开销.实验结果表明,本文系统可以提供出色的预测能力,而且可以在整个实验数据集中以最少的处理时间提供有效的性能.
To solve the program that big data is difficult to predict and to analyze because of data complexity,heterogeneity,security,scalability and large-scale data,a high-dimensional big data predictive analysis system based on enhanced scalable random forest(ESRF)has been proposed in this paper.The system enhances SRF by performing hyperparametric optimization on the training dataset,and then applies principal component analysis(PCA)and information gain(IG)to the preprocessed data to reduce the features that don't affect the model in order to reduce the processing time in the model development phase.Experimental results show that the proposed system can provide excellent prediction ability and effective performance in the whole experimental dataset with the least processing time.
作者
李发陵
彭娟
LI Fa-ling;PENG Juan(College of Software, Chongqing Institute of Engineering, Chongqing 400056, China)
出处
《西南师范大学学报(自然科学版)》
CAS
2021年第1期1-6,共6页
Journal of Southwest China Normal University(Natural Science Edition)
基金
国家自然科学基金项目(61572089)
重庆工程学院科研基金资助项目(2019xzky02)
重庆市教育科学项目(2017-GX-038).
关键词
高维大数据
增强可伸缩随机森林
降维
预测分析
超参数优化
high-dimensional big data
enhance scalable random forest
dimension reduction
predictive analysis
super parameter optimization