摘要
针对传统样例选择方法压缩大数据集时,存在计算复杂度较高、时间消耗较大的问题,文中提出基于非平稳割点的样例选择方法.依据在区间端点得到凸函数的极值这一基本性质,通过标记非平衡割点度量一个样例为端点的程度,然后选取端点程度较高的样例,从而避免样例之间距离的计算.该方法旨在不影响分类精度的前提下,达到压缩数据集、提高计算效率的目的.实验表明,文中方法对于类别不平衡度较高的数据集压缩效果明显,同时表现出较强的抗噪性.
When the traditional sample selection methods are used to compress the large data, the computational complexity and large time consumption are high. Aiming at this problem, a sample selection method based on unstable cuts for the compression of large data sets is proposed in this paper. The extreme value is obtained at the interval endpoint for convex function, and therefore the endpoint degree of a sample is measured by making the unstable cuts of all attributes according to the basic property. The samples with higher endpoint degree are selected, and the calculation of the distance between the samples is avoided. The efficiency of the computation is improved without affecting the classification accuracy. The experimental results show a significant effect of the proposed algorithm on the compression for the large data set with high imbalance ratio and strong ability of anti-noise.
作者
王熙照
邢胜
赵士欣
WANG Xizhao XING Sheng ZHAO Shixin(College of Mathematics and Information Science, Hebei University, Baoding 071002 School of Management, Hebei University, Baoding 071002 College of Computer Science and Engineering, Cangzhou Normal University, Cangzhou 061001 Department of Mathematics and Physics, Shijiazhuang Tiedao University, Shijiazhuang 050045)
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2016年第9期780-789,共10页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.713710630)
深圳市科技计划项目(No.JCYJ20150324140036825)资助~~
关键词
大数据分类
样例选择
非平稳割点
决策树
Large Data Classification
Sample Selection
Unstable cut-points
Decision Tree