摘要
提出了一种基于MapReduce和上采样的两类非平衡大数据分类方法,该方法分为5步:(1)对于每一个正类样例,用MapReduce寻找其异类最近临;(2)在两个样例点之间的直线上生成若干个正类样例;(3)以新的正类样例子集的大小为基准,将负类样例随机划分为若干子集;(4)用负类样例子集和正类样例子集构造若干个平衡数据子集;(5)用平衡数据子集训练若干个分类器,并对训练好的分类器进行集成。在5个两类非平衡大数据集上与3种相关方法进行了实验比较,实验结果表明本文提出的优于这3种方法。
Based on MapReduce and upper sampling,an approach for imbalanced big data classification is proposed in this paper. The proposed method includes five steps:( 1) For each positive instance,its nearest neighbor is found by MapReduce.( 2) Some positive instances on the line between the two points are created.( 3)According to the cardinality of the set of positive instances,the set of negative instances is partitioned into some subsets.( 4) Some balanced subsets are generated with the set of positive instances and the subset of negative instances.( 5) Some classifiers are trained by extreme learning machine on the generated balanced subsets,and the trained classifiers are integrated by majority voting for classifying new instances. Experimental comparisons with three related methods are conducted on five imbalanced big data sets. The experimental results show that the proposed method outperforms the three methods.
作者
翟俊海
张明阳
王陈希
刘晓萌
王耀达
Zhai Junhai1,2 , Zhang Mingyang2 , Wang Chenxi3 , Liu Xiaomeng2 , Wang Yaoda2(1. Key Lab of Machine Learning and Computational Intelligence, Baoding, 071002, China; 2. College of Mathematics and Infor mation Science, IIebei University, Baoding, 071002, China; 3. College of Computer Science and Technology, IIebei University Baoding, 071002, Chin)
出处
《数据采集与处理》
CSCD
北大核心
2018年第3期416-425,共10页
Journal of Data Acquisition and Processing
基金
国家自然科学基金(71371063)资助项目
河北省自然科学基金(F2017201026)资助项目
河北大学自然科学研究计划(799207217071)资助项目
关键词
大数据
非平衡分类
上采样
最近邻
big data
imbalanced classification
upper sampling
nearest neighbor