摘要
在现代信息技术领域,如何快速、准确和全面地找到用户真正所需要的信息,已经成为该领域的研究重点。在文本分类的理论基础之上,文章针对KNN算法存在的不足,设计了一种基于聚类密度的文本分类算法,通过计算待分类文本的相似度及其权重值的大小判断待分类文本的所属类别。并通过3个实验对该分类算法进行了验证,实验结果表明,基于聚类密度的分类算法在不同特征选择方法、不同特征词数下的分类效果都优于KNN分类算法,同时证明在多种相似度算法中,Jensen-Shannon散度更适合聚类密度算法。
In the field of modern information technology, the method that can find accurate information quickly has already been the key research field. Based on text categorization theory, the paper puts forward a text categorization algorithm based on density clustering because of the shortcomings of the KNN algorithm. The algorithm text classification of the category is judged by computing the text similarity and the size of the weight value. And the algorithm is validated through three experiments. Experimental results show that the algorithm based on density clustering in different feature selection methods and the classification effect of differentcharacteristic words is better than KNN classification algorithm, and also proving in a variety of similarity algorithm, Jensen-Shannon divergence is more suitable for density clustering algorithm.
出处
《图书馆学研究》
CSSCI
2016年第13期74-83,共10页
Research on Library Science
基金
国家社会科学基金项目"数字图书馆标签系统的语义挖掘研究"(项目批准号:12CTQ003)的研究成果之一