摘要
在短文本分类中,特征项的选择和特征权重的计算是非常重要的两个步骤。传统卡方统计量方法(CHI)存在特征项与类别负相关的问题,使得短文本分类模型的性能并不好。笔者就此问题提出了一种新的混合特征选择算法,用改进的短文本类关键词抽取方法,结合改进的CHI特征选择的方法,以及将类关键词扩展到文档向量中,有效克服了CHI方法的特征项与类别负相关的问题。通过对网络医务咨询短文本分类的实验,对新算法与传统CHI方法以及其他特征选择算法的实验结果作对比,表明了新算法要优于传统特征选择算法。
In short text classification,the selection of feature items and the calculation of feature weights are two very important steps.The traditional CHI has the problem of negative correlation between feature items and categories,making the performance of the short text classification model unsatisfactory.This paper proposes a new hybrid feature selection algorithm for this problem.It uses an improved short text class keyword extraction method,combines improved CHI feature selection methods,and extends class keywords into document vectors,effectively overcoming the problem of CHI has negative correlation between the feature item and the category.Through experiments on short text classification of medical consultation in the network,the comparison between the experimental results of the new algorithm and the traditional CHI method and other feature selection algorithms shows that the new algorithm is superior to the traditional feature selection algorithm.
作者
张强强
苏变萍
李敏
Zhang Qiangqiang;Su Bianping;Li Min(College of Science,Xi'an University of Architecture and Technology,Xi'an Shaanxi 710055,China)
出处
《信息与电脑》
2018年第16期34-36,共3页
Information & Computer
基金
陕西省社会科学基金项目(项目编号:13D175)