摘要
业界已经开始运用云平台来处理海量高维数据,将各种异构系统仿真为一个系统,其中在Hadoop环境进行数据挖掘会遇到数据模型的全局性、HDFS的文件随机写操作、数据生命周期短等问题。为解决这些问题,在Hadoop上实现高效海量数据挖掘,提出了在Hadoop上一种高效数据挖掘框架,利用数据库来模拟链表结构,管理挖掘出来的知识,提供了树形结构、图模型的分布式计算方法;在此基础上实现一个统计算法——Yscore分箱算法,以及决策树和KD树的建树算法;并利用Vega云对Hadoop集群进行仿真。实验数据表明该框架和算法实用可行,且可能拓展与数据挖掘之外的其他领域。
The cloud platform has been dealt in industry with large-scale high-dimensional data. A variety of heterogeneous systems have been simulated as one system, in which data mining on Hadoop will encounter the issues, such as the globalization of data models, the random write operations of HDFS files, and the duration of data life. For practical large-scale high-dimensional data mining, an efficient data mining framework on Hadoop was proposed to solve these problems, which used databases to simulate the linked list structure, and provided a distributed algorithm for structures of tree and graph model. Based on it, a statistical algorithm-Yscore binning - was proposed, as well as the DB-tree and KD-tree building algorithm. The Vega cloud was used as a simulation of Hadoop cluster. The experimental data shows that the framework and the algorithm is practical and feasible, and may be expanded to other areas outside of data mining.
出处
《系统仿真学报》
CAS
CSCD
北大核心
2013年第5期936-944,共9页
Journal of System Simulation
基金
国家自然科学基金(61035003
61072085
61202212
60933004)
国家973项目(2013CB329502)
国家863高技术研究发展计划课题(2012AA011003)
国家科技支撑计划(2012BA107B02)