摘要
目的对山西省某三甲医院2011-2017年间血液科新诊断的弥漫大B细胞淋巴瘤患者(diffuse large B-cell lymphoma,DLBCL)是否实现两年无事件生存,即DLBLC患者早期复发的预测。方法根据无事件生存期,将患者分成早期复发和非早期复发,并以此为标签构建分类模型。首先对数据进行了归一化处理,然后用LASSO进行了特征选择,因数据类别不平衡,分别采用了SMOTE(synthetic minority over-sampling technique)、Borderline-1 SMOTE、Borderline-2 SMOTE与ADASYN(adaptive synthetic sampling)四种方法平衡数据,之后构建了基于支持向量机的多核模型作为最终的分类器,并与AdaBoost、随机森林和以高斯核、多项式核为内核的单核支持向量机进行比较,最终实现对新诊断病例早期复发的预测。结果在本文所有模型中,采用LASSO加Borderline-1 SMOTE的多核模型(accuracy=0.87,precision=0.87,recall=0.87,f1=0.87,AUC=0.87)取得了最优的分类性能。采用SMOTE的随机森林模型(accuracy=0.84,precision=0.85,recall=0.87,f1=0.79,AUC=0.83)、Borderline-2 SMOTE的随机森林(accuracy=0.84,precision=0.85,recall=0.87,f1=0.79,AUC=0.83)两种集成模型的分类性能也较好,但都低于多核支持向量机模型。两种单核支持向量机性能较差。结论本文构建的所有模型中,经过LASSO和Borderline-1 SMOTE重采样的多核支持向量机性能最优,可为DLBCL早期复发预测提供参考。
Objective To predict whether the newly diagnosed diffuse large B-cell lymphoma patients in a third-class hospital in Shanxi Province from 2011 to 2017 can achieve two-year event-free survival,that is,the prediction of early recurrence of DLBLC patients.Methods According to the event-free survival time,the patients were divided into early recurrence and non-early recurrence,and the classification model was constructed.Firstly,the data is normalized,and then the feature is selected by LASSO.Because of the data imbalance,four methods are used to balance the data:SMOTE(synthetic minority over-sampling technique),Borderline-1SMOTE,Borderline-2 SMOTE and ADASYN(adaptive synthetic sampling),respectively.Then a Multi-Kernel model based on support vector machine is constructed as the final classifier.And compared with AdaBoost,Random Forest and Single-Kernel support vector machine with Gaussian kernel and polynomial kernel as kernel,and finally realized the prediction of early recurrence of newly diagnosed cases.Results The Multi-Kernel model with LASSO plus Borderline-1 SMOTE(accuracy=0.87,precision=0.87,recall=0.87,f1=0.87,AUC=0.87)achieved the best classification performance.The classification performance of the two integrated models Random Forest(accuracy=0.84,precision=0.85,recall=0.87,f1=0.79,AUC=0.83)of SMOTE and Random Forest(accuracy=0.84,precision=0.85,recall=0.87,f1=0.79,AUC=0.83)of Borderline-2 SMOTE are also better,but they are lower than the Multi-Kernel support vector machine model.The performance of two kinds of Single-Kernel support vector machines is poor.Conclusion Among all the models in this paper,the Multi-Kernel support vector machine with LASSO and Borderline-1 SMOTE has the best performance,which can provide reference for the prediction of early recurrence of DLBCL.
作者
邢蒙
周洁
余红梅
张岩波
阳桢寰
赵艳琳
李雪玲
李琼
赵志强
罗艳虹
Xing Meng;Zhou Jie;Yu Hongmei(Department of Health Statistic,School of Public Health,Shanxi Medical University 030001,Taiyuan)
出处
《中国卫生统计》
CSCD
北大核心
2022年第4期518-521,528,共5页
Chinese Journal of Health Statistics
基金
山西省科技厅应用基础研究计划面上项目(202103021224245)
国家自然科学基金青年科学基金(81502897)
山西医科大学博士启动基金(BS2017029)
国家自然科学基金面上项目(81973154)。
关键词
弥漫大B细胞淋巴瘤
早期复发
多核学习
不平衡数据
Diffuse large B lymphoma
Early recurrence
Multiple kernel learning
Imbalance data