摘要
在对多类不均衡的网络流量进行分类时,基于机器学习的分类模型倾向于多数类,导致少数类召回率较低.针对该问题,提出一种基于统计频率的特征选择方法.该方法首先根据样本的统计频率计算出度量每个特征区分能力的特征选择系数,然后根据特征选择系数构建特征选择矩阵,最后为每个类选择与之相关性较强的特征.在实验阶段,使用该方法选择的特征对多类不均衡的网络流量进行分类获得了较高的整体准确率、少数类召回率和g-mean值,证明该方法可以减轻多类不均衡问题带来的不良影响.
In the process of classffy, ing multi-class imbalanced Internet traffic, classification models based on machine learning algo- rithms are biased to majority classes, which leading to low recalls of minority classes. To solve this problem, a new feature selection method based on statistic frequency is proposed. In this method, the feature selection coefficient which indicates the distinguishing abil- ity of the feature is calculated according to the samples' statistic frequency ,and then the feature selection matrix is constructed accord- ing to the coefficients. Finally,the features which have a strong correlation with specific class arc selected. In the experimental stage,the classification result with features selected through this method has a better integrated performance on overall accuracy,l-ecalls of minority classes and g-mean, which proves that this method can reduce the adverse effects caused by the multi-class imbalance problem.
出处
《小型微型计算机系统》
CSCD
北大核心
2016年第11期2483-2487,共5页
Journal of Chinese Computer Systems
基金
国家自然科学青年基金项目(61302093)资助
上海市科委重大项目(14511101505)资助
上海市科委院市合作专项(13DZ1511200)资助
中科院重点部署项目(KGZW-EW-103)资助
东南大学移动通信国家重点实验室开放研究基金项目(2013D07)资助
关键词
网络流量分类
多类不均衡
统计频率
特征选择
interact traffic classification
multi-class imbalance
statistic frequency
feature selection