摘要
随着机器学习技术的快速发展,越来越多的机器学习算法被用于攻击流量的检测与分析,然而攻击流量往往只占网络流量中极小的一部分,在训练机器学习模型时存在训练集正负样本不平衡的问题,从而影响模型训练效果。针对不平衡样本问题,文中提出了一种基于变分自编码器的不平衡样本生成方法,其核心思想是在对少数样本进行扩充时,不是对全部进行扩充,而是分析这些少数样本,对其中最容易对机器学习产生混淆效果的少数边界样本进行扩充。首先,利用KNN算法筛选出少数类样本中与多数类样本最近的样本;其次,使用DBSCAN算法对KNN算法筛选出的部分样本进行聚类处理,生成一个或多个子簇;然后,设计变分自编码网络模型,对DBSCAN算法区分出的一个或多个子簇中的少数类样本进行学习扩充,并将扩充后的样本加入原有样本中用于构建新的训练集;最后,利用新构建的训练集来训练决策树分类器,从而实现异常流量的检测。选择召回率和F1分数作为评价指标,分别以原始样本、SMOTE生成样本、SMOTE改进方法生成样本和文中所提方法生成样本为训练集进行对比实验。实验结果表明,在4种异常类型中,采用所提算法构造训练集训练的决策树分类器在召回率和F1分数上都有提升,F1分数相比原始样本及SMOTE方法最高提升了20.9%。
With the rapid development of machine learning technology,more and more machine learning algorithms are used to detect and analyze attack traffic.However,attack traffic often accounts for a very small portion of network traffic.When training machine learning models,there is often a problem of imbalance between the positive and negative samples of the training set,which affects model training effect.Aiming at the problem of imbalanced samples,an imbalanced sample generation method based on variational auto-encoder is proposed.The idea is that when expanding imbalanced samples,not all of them are expanded.But imbalanced samples are analyzed,and a small number of boundary samples that are most likely to have confusion effects on machine learning are expanded.First,the KNN algorithm is used to screen the samples that are closest to the majority of samples;second,DBSCAN algorithm is used to cluster the partial samples selected by the KNN algorithm to generate one or more sub-clusters;then,a VAE network model is designed to learn and expand the few samples in one or more sub-clusters distinguished by the DBSCAN algorithm.The expanded samples are added to the original samples to build a new training set;finally,the newly constructed training set is used to train decision tree classifier to detect abnormal traffic.The recall rate and F1 score are selected as the evaluation indicators.The original sample,the SMOTE-generated sample and our sample are compared.The experimental results show that the decision tree classifier trained using the proposed method in this paper has improved the recall rate and F1 score among the four types of anomalies.The F1 score is up to 20.9%,which is higher than the original sample and the SMOTE method.
作者
张仁杰
陈伟
杭梦鑫
吴礼发
ZHANG Ren-jie;CHEN Wei;HANG Meng-xin;WU Li-fa(School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China)
出处
《计算机科学》
CSCD
北大核心
2021年第7期62-69,共8页
Computer Science
基金
国家重点研发计划(2019YFB2101704)。