摘要
随着软件规模和复杂性的增加,不可避免会出现各种各样的软件缺陷,其中安全相关的软件缺陷容易被攻击者利用而可能造成重大的经济与生命财产损失.在软件开发维护过程中一般会采用缺陷报告追踪系统以缺陷报告的形式及时地记录和追踪软件所产生的缺陷.自动识别安全缺陷报告可以快速将缺陷报告仓库中和安全相关的缺陷报告识别出来,帮助修复人员及时发现安全缺陷并优先修复.目前常见的安全缺陷报告自动识别方法主要是基于文本挖掘和机器学习相结合的技术,但是由于安全相关缺陷具有特征复杂以及在实际项目中数量较少的特点,使得传统的基于机器学习的识别模型难以提取和安全相关的深层次语义特征,并且模型训练过程受数据集噪音的影响较大,从而导致模型的泛化性能提升出现瓶颈.为了解决该问题,本文提出了一种噪音过滤和深度学习相结合的安全缺陷报告识别框架,该框架首先使用词嵌入技术获取语料库中所有单词的分布式向量表示,然后采用本文提出的基于生成模型的噪音过滤方法FSDON(Filtering Semantically Deviating Outlier NSBRs)过滤与安全缺陷报告语义相似并且可能是噪音的非安全缺陷报告,最后使用不同的深度神经网络(LSTM、GRU、TextCNN和Multi-scale DCNN)构建安全缺陷报告识别模型,完成安全缺陷报告自动识别任务.本文方法在5个不同规模的数据集上进行了实验评估,实验结果表明,相比于目前最先进的基于文本挖掘和机器学习相结合的方法,本文方法在g-measure指标上平均提升8.26%,并且在不同规模的数据集上的性能均优于现有最先进的方法.
With the increase of the scale and complexity of software,it is inevitable that there will be various software bugs.The security-related software bugs are easy to be exploited by malicious users to launch attacks and cause great damage.In software development and maintenance process,the bug report tracking systems such as Bugzilla are usually used to record and track the bugs in the form of bug reports.The identification of the security bug report automatically quickly identifies the security related bug reports in the bug report tracking systems,which could help the developers to work on fast fixing bugs.Recently,many existing methods for security bug report detection have been gaining much attention to tackle such problems by combining text mining and machine learning.However,owing to the small sample size and complex characteristics of security-related bug reports,it is difficult for most previous work based on machine learning methods to capture deep semantic information from textual fields of bug reports.In addition,previous approaches focus on filtering the noise bug reports from datasets using text mining models without considering the semantic information,which leads to a bottleneck for further improving the prediction performance of the trained model.In order to address the aforementioned problems,in this paper,we develop a novel framework to predict unknown security bug reports by combining semantic-based noise filtering with deep learning techniques.More concretely,it firstly leverages the word embedding technique to get the dense and low-dimensional vector representation of all words in corpus.Secondly,it leverages the proposed Filtering Semantically Deviating Outlier NSBRs(FSDON)method to filter the non-security bug reports(NSBRs)that have higher similarity with security bug reports(SBRs).Finally,it builds predictive models for SBRs detection based on different deep learning networks(LSTM,GRU,TextCNN and Multi-scale DCNN).This method is evaluated on 5 different datasets,and the experimental results s
作者
蒋远
牟辰光
苏小红
王甜甜
JIANG Yuan;MU Chen-Guang;SU Xiao-Hong;WANG Tian-Tian(Faculty of Computing,Harbin Institute of Technology,Harbin 150001)
出处
《计算机学报》
EI
CAS
CSCD
北大核心
2022年第8期1794-1813,共20页
Chinese Journal of Computers
基金
国家自然科学基金项目(61672191)
“十三五”国家重点研发计划(2017YFC0702204)资助.
关键词
安全缺陷报告识别
生成模型
缺陷报告噪音过滤
深度学习
security bug report detection
generation model
noise filtering of bug reports
deep learning