摘要
针对软件缺陷预测中标记样本难以获取以及分类不平衡的问题,提出一种基于半监督集成学习方法的软件缺陷预测模型(Tri_Adaboost).一方面利用欠采样方法以及半监督学习对标记样本进行扩充,随机选取一部分无标记样本进行预标注,缓解标记样本不足的问题;另一方面,利用SMOTE方法对扩充后的标记样本进行采样,然后使用AdaBoost集成方法对标记样本集进行预测.本文在NASA MDP数据集及基于开源项目下生成的空指针引用缺陷数据集上,验证模型的有效性,较于四种基本的机器学习分类方法,Tri_Adaboost算法在F-measure和AUC上均能取得较高的值.
Aiming at the problem that the large number of labeled samples in the software defect prediction are difficult to obtain and the existence of class imbalanced in the software system, a semi-supervised ensemble learning method is proposed. On the one hand, under-sampling method and semi*supervised learning are used to extend the labeled samples, some unlabeled samples are randomly selected for pre-labeled to alleviate the insufficient of labeled samples; On the other hand, the SMOTE method is used to sample the extended labeled samples, and then the AdaBoost ensemble method is used to predict the labeled sample set. The paper verifies the validity of the model based on the NASA MDP data set and the null pointer defect dataset generated under the open source project, compared with the four basic machine learning classification methods, Tri_Adaboost algorithm can achieve higher values on F-measure and AUC.
作者
张肖
王黎明
ZHANG Xiao;WANG Li-ming(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2018年第10期2138-2145,共8页
Journal of Chinese Computer Systems