摘要
针对在C4.5决策树构造过程中,测试属性选择未考虑属性之间影响的缺点,提出了一种改进的C4.5决策算法.该算法使用一个属性与其他属性的平均信息熵表示这个属性与其他属性的冗余度,然后在选择测试属性的过程中,加入测试属性与其他属性的冗余度,通过信息增益、分裂熵和冗余度三个因素的评价,选择信息增益率高而与其他属性冗余度低的测试属性.实验结果表明,在选定的实验数据集上,改进后的C4.5决策树算法平均分类正确率提高.
In view of the disadvantage that the chose of test attribute don't consider the interaction between the attributes in the construction process of CA. 5 decision tree, an improved C4.5 decision algorithm was pro- posed. Redundancy of the test attribute with other attributes was represented by average information gain. Then redundancy of the test attribute with other attributes was added to the algorithm. The algorithm select- ed the test attribute with high information gain ratio and low redundancy by information gain, split entropy and redundancy three evaluation factors. The experimental results illustrate that the improved C4.5 decision tree algorithm increases average classification accuracy on selected experimental data sets.
出处
《中北大学学报(自然科学版)》
CAS
北大核心
2014年第4期402-406,共5页
Journal of North University of China(Natural Science Edition)
关键词
C4
5决策树
属性相关
信息熵
信息增益率
冗余度
CA. 5 decision tree
attributes correlation
information entropy
information gain ratio
redundancy