摘要
目的 不平衡样本在医疗、金融等领域普遍存在,其分类的准确性至关重要,而目前的算法如决策树、逻辑回归等传统机器学习算法对不平衡数据少数类的分类精度较低,因此优化不平衡样本的分类性能非常必要。方法以中风数据集为例,从数据层、特征层、算法层三个层面对不平衡数据集建立最优化预测模型,在数据层采用SMOTEENN采样技术,在特征层采用基于随机森林的递归消除法,在算法层采用CatBoost、XGBoost集成算法。结果通过模型性能对比,得出了预测性能最佳的最优化预测模型:“SMOTEENN采样+基于随机森林的特征递归消除法(RFRFE)+XGBoost分类算法”模型,该模型可提高中风预测准确率,便于民众进行中风患病风险预估,为医生决策提供参考,也可推广应用于疾病类不平衡样本的风险预测问题。
Objective Imbalanced samples are common in medical care,finance and other fields,and the accuracy of their classification is very important.However,current algorithms such as decision tree,logistic regression and other traditional machine learning algorithms have low classification accuracy for a few classes of unbalanced data,so it is necessary to optimize the classification performance of unbalanced samples.Methods In this paper,the stroke data set was taken as an example,and the optimal prediction model was established for the unbalanced data set from three levels:data layer,feature layer and algorithm layer.SMOTEENN sampling technology was adopted in the data layer,random forest-based recursive elimination method was adopted in the feature layer,and CatBoost and XGBoost integrated algorithms were adopted in the algorithm layer.Results By comparing the model performance,the optimal prediction model with the best prediction performance was obtained:"SMOTEENN sampling+Random forestbased feature recursive elimination+XGBoost Classification algorithm"model.this model can improve the accuracy of stroke prediction,facilitate the public to estimate the risk of stroke,provide a reference for doctors to make decisions,and can also be applied to the risk prediction of unbalanced samples of diseases.
作者
韩朝怡
连高社
HAN Zhao-yi;LIAN Gao-she(Department of Science,Taiyuan Institute of Technology,Taiyuan Shanxi,030008)
出处
《山西大同大学学报(自然科学版)》
2023年第3期31-35,共5页
Journal of Shanxi Datong University(Natural Science Edition)
基金
太原工业学院院基金项目[2020LG06]。