期刊文献+

中文文本分类研究 被引量:6

Study of Chinese Text Categorization
下载PDF
导出
摘要 使用k近邻、支持向量机和最大熵模型进行中文文本分类的研究,对目前应用较多的k近邻、支持向量机和最大熵模型,分别进行了基于特征词布尔值和基于特征词词频的中文文本分类实验。实验结果显示,在相同的条件下最大熵方法的分类性能最好,支持向量机次之,k近邻稍差。同时发现,在分类过程中引入了词语频率信息时,分类器的性能略有变化,对于最大熵分类准确率下降1%~2%,对于k近邻有所上升,对于支持向量机则相当。除去文本的特殊性影响,这表明不同程度的词语的信息对不同的机器学习算法有不同的影响。 In this paper, we compare the three models of k-nearest neighbor, support vector machines and maximum entropy in text categorization. By using two training data set that has been classified by term selection and remove irrelevant data seperately, we carry out some experiments using the three models. The result of the experiments shows that the maximum entropy is better than the other two classifiers on either Boolean value condition or adding the frequency of words. The maximum entropy performance is the best in the three models. We also find that when we add the information of frequency of words the classifiers' performance has some changes. Despite the influence of the particularity of documents, this result suggests that the different kind of term sets may cause different results to different classifier's performance.
出处 《太原理工大学学报》 CAS 北大核心 2006年第6期710-713,共4页 Journal of Taiyuan University of Technology
关键词 文本分类 K近邻 支持向量机 最大熵 text categorization k-nearest neighbor support vector machines maximum entropy
  • 相关文献

参考文献9

  • 1Y Yang,X Lin.A re-examination of text categorization methods[M].In:The 22nd Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval.New York:ACM Press,1999. 被引量:1
  • 2Thorsten Joachims.Text Categorization with Support Vector:Machines Learning with Many Relevant Features[C].In European Conference on Machine Learning(ECML),Berlin,1998:137-142. 被引量:1
  • 3D D Lewis.Naive (Bayes) at forty:the independence assumption in information retrieval[C].In the 10th European Conference on Machine Learning,New York,1998:4-15. 被引量:1
  • 4R Adwait.Maximum entropy models for natural language ambiguity resolution[D].USA:University of Pennsylvania,1998. 被引量:1
  • 5谷波,刘开瑛.决策树模型和最大熵模型在文本分类中的比较研究工作[C].全国第八届计算语言学联合学术会议,南京,2005:382-387. 被引量:1
  • 6Adam L Berger,Stephen A Della Pietra,Vincent J.Della Pietra.A maximum entropy approach to natural language processing[J].Computational Linguistics,1996,22(1):38-73. 被引量:1
  • 7苑春法,李庆中,王昀,等.统计自然语言处理基础[M].北京:电子工业出版社,338-374. 被引量:1
  • 8V Vapnic.The Nature of Statistical Learning Theory[M].New York:Springer,1995. 被引量:1
  • 9Darroch J N,D Ratcliff.Generalized iterative scaling for log-linear models[J].The Annals of Mathematical Statistics,1972,43:1470-1480. 被引量:1

同被引文献55

引证文献6

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部