摘要
针对近红外光谱中的噪声和冗余信息导致分类模型识别率低的问题,提出了随机森林结合博弈论的特征选择算法。该算法首先根据随机森林对特征重要性进行度量,优选出对分类具有一定相关性的特征;然后利用改进的夏普利值结合互信息计算优选特征的权重,从加权后的特征集合中去掉冗余得到最优特征子集。为了验证算法的有效性,将其应用于烟叶产地识别模型,实验结果表明,该文所提出的特征选择算法对烟叶产地识别效果较好,分类识别率可达95.88%。
The feature selection algorithm based on the combination of random forest and game theory was proposeed in this paper as noise and redundant information in the near infrared spectroscopy would lead to the low recognition rate of a model. This algorithm was first used to measure the feature significance according to the random forest and select some features related to classification, then compute the weights of selected characters by using the improved Shapley values and mutual informa- tion computed to remove redundant information from the weighted feature set and get the optimal fea- ture subset. To validate effectiveness of this algorithm, the tobacco leaf production area identification model was established. The experimental results indicated that the algorithm proposed in this paper had a good recognition on the area of tobacco leaf production with a recognition rate of 95.88%.
出处
《分析测试学报》
CAS
CSCD
北大核心
2017年第10期1203-1207,共5页
Journal of Instrumental Analysis
基金
国家科技支撑计划项目(2015BAF12B01)
云南中烟工业有限责任公司项目(JSZX2014YL01
20530001020152000086)
关键词
近红外光谱
随机森林
特征选择
夏普利值
产地识别
NIR spectroscopy
random forest
feature selection
shapley value
production area i- dentification