期刊文献+

融合文本与分类信息的重复缺陷报告检测方法 被引量:9

Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information
下载PDF
导出
摘要 软件缺陷是软件出现错误、故障的根源。软件缺陷是需求分析不合理、编程语言不严谨、开发人员缺少经验等因素导致的。软件缺陷不可避免,提交缺陷报告是发现缺陷并改进缺陷的重要途径。缺陷报告是描述缺陷的载体,对缺陷报告的修复是完善软件的必要手段。维护人员和用户因同一缺陷重复提交报告,导致缺陷报告库中存在大量冗余的报告,手动分诊已无法适应越来越复杂的软件系统。重复缺陷报告检测能过滤缺陷报告库中冗余的重复报告,并将人力与时间投入到新的缺陷报告上。当前研究方法的预测准确率始终不高,其难点在于寻找一个合适且全面的方法来衡量缺陷报告之间的相似性。借鉴集成方法的思想,提出了一种基于文本信息、分类信息相融合的重复缺陷报告检测方法——BSO(combination of BM25F、LSI and One-Hot)。在数据预处理的基础上,文中将重复缺陷报告分割为文本信息域与分类信息域。在文本信息域上使用BM25F与LSI算法,得到两个方法的相似性打分,运用相似性融合方法将两个方法的相似性打分进行整合;在分类信息域上使用One-Hot算法得到相似性打分。运用相似性融合方法,融合文本信息域与分类信息域的相似性打分,为每个缺陷报告对应一个重复缺陷报告推荐列表,并计算重复缺陷报告检测的准确率。利用Python语言,在公开的数据集OpenOffice上与基线方法以及较新水平方法REP、DBTM进行对比。实验结果表明,与DBTM相比,本文方法的准确率平均提高了4.7%;与REP方法相比,本文方法的准确率平均提高了6.3%;与基线方法相比,本文方法的准确率提升较高。实验结果充分证明了BSO方法的有效性。 Software defect is the root of software errors and failures.Software defect is caused by unreasonable requirement analysis,imprecise programming language and lack of experience of developers.Software defects are inevitable,and submitting defect reports is an important way to find and improve defects.Defect report is the carrier of describing defects,and the repair of defect report is the necessary means to improve software.Maintenance personnel and users submit reports for the same defect repeatedly,resulting in a large number of redundant reports in the defect report library.Manual triage is unable to adapt to more and more complex software systems.The detection of duplicate defect reports can filter redundant duplicate reports from defect report libraries and invests human and time in new defect reports.The prediction accuracy rate of current research methods is not high,and the difficulty is to find a suitable and comprehensive method to measure the similarity between defect reports.Based on the idea of the integration method and the python language,a new method named BSO(combination of BM25F,LSI and One-Hot)for detecting duplicate defect report was proposed by using text information and categorization information.On the basis of data preprocessing,duplicate defect report is divided into text information domain and categorization information domain.BM25F and LSI algorithms are used to get similarity scores in text information domain,and One-Hot algorithm is used to get similarity scores in categorization information domain.The similarity fusion method is used to synthesize the similarity score between text information domain and categorization information domain,and a recommendation list for each defect report corresponds to a duplicate defect report.The accuracy of the duplicate defect report detection is calculated.Compared with the baseline method and the state-of the art methods including REP and DBTM on OpenOffice.The experimental results show that the accuracy of the proposed method is 4.7%higher than th
作者 范道远 孙吉红 王炜 涂吉屏 何欣 FAN Dao-yuan;SUN Ji-hong;WANG Wei;TU Ji-ping;HE Xin(College of Software,Yunnan University,Kunming 650500,China;Academy of Sciences in Yunnan Province,Kunming 650091,China;Key Laboratory for Software Engineering of Yunnan Province,Kunming 650500,China)
出处 《计算机科学》 CSCD 北大核心 2019年第12期192-200,共9页 Computer Science
关键词 重复缺陷报告 信息检索方法 主题模型 One-Hot 相似性融合 Duplicate defect report Information retrieval method Topic model One-Hot Similarity fusion
  • 相关文献

参考文献2

二级参考文献16

  • 1张丽新,王家廞,赵雁南,杨泽红.基于Relief的组合式特征选择[J].复旦学报(自然科学版),2004,43(5):893-898. 被引量:44
  • 2Cubranic D, Murphy G C. Automatic Bug Triage Using Text Categorization[C]//Proc. of the 16th International Conference on Software Engineering and Knowledge Engineering. Edinburgh, UK: [s. n.], 2004. 被引量:1
  • 3Anvik J, Hiew L, Murphy G C. Who Should Fix This Bug?[C]// Proc. of the 28th International Conf. on Software Engineering. Shanghai, China: [s. n.], 2006. 被引量:1
  • 4Ahsan S N, Ferzund J, Wotawa E Automatic Software Bug Triage System(BTS) Based on Latent Semantic Indexing and Support Vector Machine[C]//Proc. of the 4th International Conference on Software Engineering Advances. Porto, Portugal: [s. n.], 2009. 被引量:1
  • 5Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022. 被引量:1
  • 6Giffiths T L, Steyvers M. Finding Scientific Topics[J]. Proc. of National Academy of Science, 2004, 101(SI): 5228-5235. 被引量:1
  • 7Liu Y, Zheng Y F. A novel feature selection method for support vector machines [ J ]. Pattern Recognition,2006,39 : 1333 - 1345. 被引量:1
  • 8Sikonja R M, Kononenko I. Theoretical and empirical analys is of Re- liefF and RReliefF[ J]. Machine Learning,2003,53( 1 -2) :23 -69. 被引量:1
  • 9Kononenko I. Estimation: Analysis and extensions of relief [ C ]//Pro- ceedings of the 1994 European Conference on Machine Learning[ S. 1. J. ACM Press, 1997:273 - 324. 被引量:1
  • 10Kenji K, Rendell L A, Rendell A. A practical approach to feature selec- tion machine learning[ C ]//Proceedings of ICML' 92. Aberdeen, Scot- land, UK[ s. n. ] , 1992:249 - 256. 被引量:1

共引文献12

同被引文献60

引证文献9

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部