摘要
使用k近邻、支持向量机和最大熵模型进行中文文本分类的研究,对目前应用较多的k近邻、支持向量机和最大熵模型,分别进行了基于特征词布尔值和基于特征词词频的中文文本分类实验。实验结果显示,在相同的条件下最大熵方法的分类性能最好,支持向量机次之,k近邻稍差。同时发现,在分类过程中引入了词语频率信息时,分类器的性能略有变化,对于最大熵分类准确率下降1%~2%,对于k近邻有所上升,对于支持向量机则相当。除去文本的特殊性影响,这表明不同程度的词语的信息对不同的机器学习算法有不同的影响。
In this paper, we compare the three models of k-nearest neighbor, support vector machines and maximum entropy in text categorization. By using two training data set that has been classified by term selection and remove irrelevant data seperately, we carry out some experiments using the three models. The result of the experiments shows that the maximum entropy is better than the other two classifiers on either Boolean value condition or adding the frequency of words. The maximum entropy performance is the best in the three models. We also find that when we add the information of frequency of words the classifiers' performance has some changes. Despite the influence of the particularity of documents, this result suggests that the different kind of term sets may cause different results to different classifier's performance.
出处
《太原理工大学学报》
CAS
北大核心
2006年第6期710-713,共4页
Journal of Taiyuan University of Technology
关键词
文本分类
K近邻
支持向量机
最大熵
text categorization
k-nearest neighbor
support vector machines
maximum entropy