摘要
决策树算法是在已知具有不同特征的样本数据出现的概率基础上,构建决策树来进行数据分析的一种算法。在数据分类算法中,决策树算法是一种经典的分类决策算法。首先,将所有的数据特征看作是各个树的节点,遍历所有特征,其中每当遍历到其中某个特征时,对特征进行分割处理,并记录分割点的数据信息,作为划分子节点的纯度依据。其次,比较记录的数据特征以及判定最优特征,寻找最优划分方式,对样本数据集进行分割操作。最后,构建符合规则的决策树。针对传统的决策树C4.5算法计算信息增益率时间过长的问题,提出了一种改进的K-C4.5算法,引用麦克劳林公式和泰勒公式的思想,将信息增益率计算公式从对数函数转化为非对数函数,从而降低运算的时间效率。以实际数据集进行测试,验证了改进后的算法具有一定的效果。
The decision tree algorithm is an algorithm to construct a decision tree for data analysis based on the probability of occurrence of sample data with different characteristics. In the data classification algorithm,the decision tree algorithm is a classic classification decision algorithm. First,all data features are treated as nodes of each tree,and all features are traversed. Whenever one of the features is traversed,the feature is segmented and the data of the segmentation point is recorded as the sub-node purity basis. Secondly,the recorded data features is compared and the optimal features is determined,and the optimal partitioning method is found to perform the segmentation operation on the sample dataset. Finally,a decision tree that conforms to the rules is built. In this paper,the problem of calculating the information gain rate is too long for the traditional decision tree C4.5 algorithm. An improved K-C4.5 algorithm is proposed,which uses the ideas of the McLaughlin formula and the Taylor formula to calculate the information gain rate. From the logarithmic function to the non-logarithmic function,the time efficiency of the operation is reduced. The actual data set is tested to verify that the improved algorithm has certain effects.
作者
李春生
焦海涛
刘澎
刘小刚
LI Chun-sheng;JIAO Hai-tao;LIU Peng;LIU Xiao-gang(School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China)
出处
《计算机技术与发展》
2020年第5期185-189,共5页
Computer Technology and Development
基金
国家自然科学基金面上项目(51774090)
黑龙江省自然科学基金面上项目(F2015020)。
关键词
决策树
数据概率
信息增益率
时间效率
改进算法
decision tree
data probability
information gain rate
time efficiency
improved algorithm