为克服不平衡数据集中存在的噪声、小分离、类内和类间不平衡问题,提出一种基于HDBSCAN(hierarchical density-based spatial clustering of applications with noise)聚类的自适应过采样技术。该技术只对HDBSCAN发现的任意形状的集群...为克服不平衡数据集中存在的噪声、小分离、类内和类间不平衡问题,提出一种基于HDBSCAN(hierarchical density-based spatial clustering of applications with noise)聚类的自适应过采样技术。该技术只对HDBSCAN发现的任意形状的集群进行过采样,在稀疏度大的集群中自适应地合成较多的样本,在稀疏度小的集群中合成相对较少的样本,且合成的样本靠近集群中心。实验结果表明,该方法可有效避免不平衡数据集中噪声的产生,同时克服类间和类内不平衡问题,为不平衡学习提供了一种过采样策略。展开更多
In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they oft...In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the Fl-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.展开更多
目前的多类学习方法大多将多类问题转化为二类问题,这样处理除了时间开销大,还存在识别盲区。提出了一种直接进行多类学习的算法multi-SVDD。该算法在考虑大样本和多类样本数据中的类内不平衡现象基础上,首先为每类训练样本进行聚类,根...目前的多类学习方法大多将多类问题转化为二类问题,这样处理除了时间开销大,还存在识别盲区。提出了一种直接进行多类学习的算法multi-SVDD。该算法在考虑大样本和多类样本数据中的类内不平衡现象基础上,首先为每类训练样本进行聚类,根据聚类结果由支持向量数据描述(SVDD,Support Vector Date Description)建立多个最小包围球。根据测试样本到SVDD所建立的最小包围球的距离来确定测试样本属于哪个聚类,最终可判断测试样本属于哪个类。multi-SVDD算法在时空开销上相比最小包围球方法没有明显增长,而实验效果则好于最小包围球方法。展开更多
文摘为克服不平衡数据集中存在的噪声、小分离、类内和类间不平衡问题,提出一种基于HDBSCAN(hierarchical density-based spatial clustering of applications with noise)聚类的自适应过采样技术。该技术只对HDBSCAN发现的任意形状的集群进行过采样,在稀疏度大的集群中自适应地合成较多的样本,在稀疏度小的集群中合成相对较少的样本,且合成的样本靠近集群中心。实验结果表明,该方法可有效避免不平衡数据集中噪声的产生,同时克服类间和类内不平衡问题,为不平衡学习提供了一种过采样策略。
基金This work is supported by the National Natural Science Foundation of China under Grant Nos. 61602403 and 61402406 and the National Key Technology Research and Development Program of the Ministry of Science and Technology of China under Grant No. 2015BAH17F01.
文摘In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the Fl-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.
文摘目前的多类学习方法大多将多类问题转化为二类问题,这样处理除了时间开销大,还存在识别盲区。提出了一种直接进行多类学习的算法multi-SVDD。该算法在考虑大样本和多类样本数据中的类内不平衡现象基础上,首先为每类训练样本进行聚类,根据聚类结果由支持向量数据描述(SVDD,Support Vector Date Description)建立多个最小包围球。根据测试样本到SVDD所建立的最小包围球的距离来确定测试样本属于哪个聚类,最终可判断测试样本属于哪个类。multi-SVDD算法在时空开销上相比最小包围球方法没有明显增长,而实验效果则好于最小包围球方法。