摘要
数据流分类模型是面向连续变化的实时分析的基本问题.目前大多数的数据流算法只针对突变式或渐变式概念漂移进行处理的,并未充分考虑概念会重现的特点.为此提出了一种具有概念漂移检测机制的自适应集成算法.从信息熵的角度出发,用Jensen-Shannon散度度量相邻两个窗口间数据分布的距离,不仅能检测出不同类型的概念漂移,且能有效地发现重现的概念;采用分类器池机制来保存历史概念,从而实现对概念的重用.将所提出的算法与几种经典的学习算法在人工合成和真实数据集上进行了广泛的对比实验.实验结果表明,所提出的算法在平均分类准确率上具有明显的优势,比其他集成算法消耗更少的时间,适合多种类型概念漂移的环境,并具有较高的抗噪性.
The processing of streaming data implies new requirements concerning limited amount of memory,small processing time,and one scan of incoming instances.Most of the approaches in the literature to deal with concept drift only focus on gradual or abrupt concept drift and have not addressed the problem of recurring concepts.Motivated by this challenge,an ensemble with internal change detection was proposed to enhance performance by exploring the recurring concepts.It is done by maintaining apool of classifiers,which dynamically adds and removes classifiers in response to the change detector.The algorithm adopts a two window change detection model,which adopts the Jensen-Shannon divergence tomeasure the distance of the distributions between two consecutive windows.When a change is detected,the repository of stored historical concepts is checked for reuse.The proposed algorithm has been experimentally compared with the state-of-the-art algorithms on synthetic and real datasets.The results show the suitability of the proposed algorithm for different types of drift as well as static environments.
作者
孙艳歌
王志海
原继东
白洋
SUN Yange WANG Zhihai YUAN Jidong BAI Yang(School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China School of Computer and Information Technology, Xinyang Normal University, Xingyang 464000, China)
基金
国家自然科学基金(61672086)
河南省科技计划(172102210454)
北京交通大学人才基金(2016RC048)
信阳师范学院青年骨干教师计划(2016GGJS-08)资助
关键词
数据流
概念漂移
集成分类器
信息熵
重复概念
data streams
concept drift
ensemble classifier
entropy
recurring concepts