摘要
Spark-MLlib中决策树算法根据其目标特征值是否连续分为分类树和回归树.其中分类决策树算法又根据其特征选择标准不同,分为ID3算法和CART算法.实验时分别使用信息熵和基尼系数作为分裂准则来划分训练数据集,并比较了两者在不同规模数据集上的性能表现.实验结果表明,在保持训练效率的情况下,随着数据集规模增大,使用信息熵训练的树模型其分类精度高于使用基尼系数训练的模型精度.
The decision tree algorithm in spark mllib can be divided into classification tree and regression tree according to whether the target eigenvalues are continuous.The classification decision tree algorithm is divided into ID3 algorithm and cart algorithm according to the different feature selection criteria.In the experiment,information entropy and Gini coefficient are used as splitting criteria to divide the training data set,and the performance of the two methods on different data sets is compared.The experimental results show that the classification accuracy of the tree model trained by information entropy is higher than that trained by Gini coefficient with the increase of the size of the dataset while maintaining the training efficiency.
作者
杜小芳
陈毅红
DU Xiaofang;CHEN Yihong(College of Computer Science,China West Normal University,Nanchong 637002,China;Internet of Things Perception and Big Data Analysis Key Laboratory of Nanchong,Nanchong 637002,China)
出处
《太原师范学院学报(自然科学版)》
2020年第4期37-39,51,共4页
Journal of Taiyuan Normal University:Natural Science Edition
基金
国家自然科学基金面上项目(61871330)
西华师范大学英才基金(17YC148)
西华师范大学博士启动基金(16E008)。