摘要
针对机器学习中含残缺项的数据不能被有效利用,导致分类和回归准确率不高的问题,提出了一种近似补全方法——k-ANNO方法。给定残缺的数据样本,该方法首先通过离线构建的图结构来近似搜索与该样本最接近的k个近邻顶点,然后采用快速二次规划估计各近邻的最优权重,最后基于权重值来补全样本中的残缺项,用户可以根据实际需求在补全效率与准确性之间折中。k-ANNO方法较好地解决了机器学习中普遍存在的数据残缺问题,有效抑制了数据残缺对分类和回归精度的干扰。利用多份公开数据集评估了k-ANNO方法的补全效果,结果表明:当加速比在2~10之间时,k-ANNO方法的分类错误率比已有的均值补全、C均值补全、自组织映射补全方法低1%~4%,回归均方根误差比已有方法低约0.5~2.0;当样本规模为4 000时,在不同加速比参数下,k-ANNO方法的计算效率比朴素k近邻方法高约35%~320%。
An approximate imputation method called k-ANNO is proposed to handle the problems of missing data in machine learning field given a missing sample.The proposed method begins by constructing an offline graph to approximately search nearest neighbors of the partially missing sample efficiently.Then a fast quadratic programming algorithm is utilized to determine the optimal weight for each neighbor.Finally,unmissed parts of the neighbors are used to impute the missing attributes by the estimated weights.Users get the freedom to weigh up between efficiency and imputation accuracy.The widespread data missing problems are well solved in this paper and k-ANNO is able to depress the impact of missing data effectively.Experiments on various well known datasets show that when the speedup rate parameters are between 2 and 10,k-ANNO method outperforms existing ones such as mean imputation or C-Means imputation etc.and the classification error and the regression error are 1% to 4% and 0.5-2.0 lower than those,respectively.Meanwhile,k-ANNO outperforms nave k-NN imputation with a faster efficiency increased by 35%-320% faster.
作者
曹卫权
褚衍杰
李显
CAO Weiquan;CHU Yanjie;LI Xian(National Key Laboratory of Science and Technology on . Blind Signal Processing,610041, China)
出处
《西安交通大学学报》
EI
CAS
CSCD
北大核心
2017年第10期142-148,共7页
Journal of Xi'an Jiaotong University
基金
国家自然科学基金资助项目(U1536105)
关键词
机器学习
残缺项
二次规划
补全方法
machine learning
missing attributes
quadratic programming
imputation method