摘要
即时软件缺陷预测是保障软件安全与质量相统一的必要途径,在软件工程领域受到越来越多的关注.然而,现有数据集存在特征冗余和特征相关性低的情况,极大影响了即时软件缺陷预测模型的分类性能和稳定性.此外,分析缺陷数据特征对模型的影响尤为重要,但如今对软件缺陷预测模型进行解释性研究较少.针对这些问题,文章基于6个开源项目的227417个代码级变更的大规模实证研究,创新性地选择了SHAP+SMOTEENN+XGBoost(SHAP-SEBoost)构建即时软件缺陷预测模型.首先通过SHAP(SHapley Additive exPlanation)模型可解释器分析初始数据集特征,并根据分析结果对数据集进行相应的特征选择与组合.然后,利用SMOTEENN对类不平衡的缺陷数据进行正负样本均衡化,使用集成学习算法XGBoost对实验数据进行预测建模.最后,使用SHAP对本文模型进行可解释性分析.实验结果表明SHAP-SEBoost有效地提高了分类性能,与基线模型以及近年优秀模型相比AUC平均提高11.6%,F1平均提升33.5%.
Just-in-time software defect prediction is a necessary way to ensure software safety and quality,which has been paid more and more attention in the field of software engineering.However,existing data sets are characterized by redundancy and low feature correlation,which greatly affects the classification performance and stability of real-time software defect prediction models.In addition,analyzing the influence of defect data characteristics on the model is particularly important,but there are few explanatory studies on software defect prediction models nowadays.To address these problems,this paper innovatively selected SHAP+SMOTEENN+XGBoost(SHAP-SEBoost)to build a real-time software defect prediction model based on a large-scale empirical study of 227,417 code-level changes in six open source projects.First,the SHapley Additive exPlanation model can be used to analyze the characteristics of the initial data set,and then select and combine corresponding characteristics of the data set according to the analysis results.Then,the positive and negative sample equalization of the class unbalanced defect data was carried out using SMOTEENN,and the integrated learning algorithm XGBoost was used to model the prediction of the experimental data.Finally,SHAP is used to analyze the interpretability of the model in this paper.Experimental results showed that SHAP-SEBoost effectively improved classification performance,with an average increase of 11.6%in AUC and 33.5%in F1 compared with baseline and recent excellent models.
作者
陈丽琼
王璨
宋士龙
CHEN Li-qiong;WANG Can;SONG Shi-long(Department of Computer Science and Information Engineering,Shanghai Institute of Technology,Shanghai 201418,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2022年第4期865-871,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61702334)资助。
关键词
即时软件缺陷预测
模型可解释性
特征工程
集成学习
just-in-time software defect prediction
model interpretability
feature engineering
ensemble learning