摘要
针对传统特征选择方法仅考虑变量间的线性关系而忽略非线性相关性,导致软件缺陷数目预测模型的性能较低的问题,提出了一种基于最大信息系数的特征选择方法。该方法考虑特征与特征以及特征与缺陷数目间的线性及非线性关系,将特征的冗余性分析和相关性分析分离为两个阶段。在冗余特征分析阶段,基于特征间的相关度,采用凝聚层次聚类算法将冗余特征分到同一簇中;在相关性分析阶段,依据特征与软件缺陷数目之间的相关度,对每个特征簇中的特征进行排序,然后从簇中选择排名靠前的特征组成特征子集。实验结果表明,该方法能够选择有效的特征子集,提高软件缺陷数目预测模型的预测性能。
The traditional feature selection method only considers the linear correlation between variables and ignores the nonlinear correlation,so it is difficult to select effective feature subsets to build the effective model to predict the number of faults in software modules.Considering the linear and nonlinear relationship,a feature selection method based on maximum information coefficient(MIC)was proposed.The proposed method separated the redundancy analysis and correlation analysis into two phases.In the previous phase,the cluster algorithm,which was based on the correlation between features,was used to divide the redundant features into the same cluster.In the later phase,the features in each cluster were sorted in descending order according to the correlation between features and the number of software defects,and then the top features were selected to form the feature subset.The experimental results show that the proposed method can improve the prediction performance of software defect number prediction model by effectively removing redundant and irrelevant features.
作者
刘国庆
王兴起
魏丹
方景龙
邵艳利
LIU Guoqing;WANG Xingqi;WEI Dan;FANG Jinglong;SHAO Yanli(School of Computer Science and Technology,Hangzhou Dianzi University,Hangzhou 310018,China)
出处
《电信科学》
2021年第5期133-147,共15页
Telecommunications Science
基金
浙江省自然科学基金资助项目(No.LY20F020015,No.LY21F020015)
国家自然科学基金资助项目(No.61702517,No.61972121,No.61702146)
国防基础科研计划资助项目(No.JCKY2019415C001)。
关键词
软件缺陷数目预测
特征选择
最大信息系数
software defect number prediction
feature selection
maximum information coefficient