摘要
目前已有的不完整数据填充方法大多局限于单一类型的缺失变量,对大规模数据的填充效果相对弱势.为了解决真实大数据中混合类型变量的缺失问题,本文提出了一个新的模型——SXGBI(Spark-based eXtreme Gradient Boosting Imputation),其适应于连续型和分类型两种缺失变量并存的不完整数据填充,同时具备快速处理大数据的泛化能力.该方法通过对集成学习方法XGBoost的改进,将多种补全算法结合在一起,构建了一个集成学习器,并结合Spark分布式计算框架进行了并行化设计,能较好地运行于Spark分布式集群上.实验表明,随着缺失率的增长,SXGBI在RMSE、PFC和F1几项评价指标上都取得了比实验中其它填充方法更好的填充结果.此外,它还可以有效地运用在大规模的数据集上.
At present,the existing imputation methods for incomplete data are mostly limited to a single type of missing variables,and the filling effect of large-scale data is relatively weak.In order to cope with the problem of mixed-type variables missing in real big data,this paper proposes a novel model which is suitablefor both continuous and categorical data,contains strong generalization capabilities and can scale up to exceedingly large datasets.Hence,we propose SXGBI(Spark-based eXtreme Gradient Boosting Imputation),a method which combines multiple imputation algorithms to construct an integrated learner by improving an ensemble learning method——XGBoost.With the parallel design of Spark distributed computing framework,XGBoost can run well on Spark distributed cluster.Comparing with existing filling methods,this assumption proves to be powerful since extensive experiments demonstrate that SXGBIcan still achieve better results in RMSE,PFC and F1 than other imputation methods with the increase of the missing rate.Besides,it can be successfully trained on a large-scale dataset.
作者
邹萌萍
彭敦陆
ZOU Meng-ping;PENG Dun-lu(School of Optional-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2021年第1期111-116,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61772342,61703278)资助。