期刊文献+

用于文本分类的快速KNN算法 被引量:5

A Fast KNN for Text Categorization
下载PDF
导出
摘要 KNN(k Nearest Neighbor)算法是一种简单、有效、非参数的文本分类方法.传统的KNN方法有着样本相似度计算量大的明显缺陷,使其在具有大量高维样本的文本分类中缺乏实用性.提出了一种快速查找精确的k个最近邻的TKNN(Tree-k-Nearest-Neighbor)算法,该算法建立一棵用于查找的树,加速k个最近邻的查找.首先以整个样本集合中心为基准,按照距离中心的距离将所有样本进行排序,并等分L组,作为根结点的孩子,每个孩子以同样方式处理,直到每组样本数量在[k,2k]间为止.根据这棵树查找k个最近邻,减小了查找范围,极大地降低了相似度计算量. The KNN is a simple, valid and non-parameter method applied to text categorization. The traditional KNN has a fatal defect that time of similarity computing is huge. The practicality will be lost when the KNN is applied to text categorization with high dimension and huge samples. In this paper, a method called TKNN(Tree-k-Nearest-Neighbor) is presented which can search the k nearest neighbors quickly. A tree for searching k nearest neighbors is created; subsequently the searching speed is quicken. First, all samples are sorted based on the similarity between itself and the central sample, then the sorted queue is divided into L groups equably. One group is a child of the root, and every child is disposed like this until the count of a group between k and 2k. Then the searching scope is reduced based on the tree. Subsequently the time of similarity computing is decreased largely.
出处 《河北大学学报(自然科学版)》 CAS 北大核心 2008年第3期322-326,共5页 Journal of Hebei University(Natural Science Edition)
关键词 KNN 文本分类 相似度 KNN text categorization similarity
  • 相关文献

参考文献5

二级参考文献34

  • 1[1]D D Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: The 10th European Conf on Machine Learning(ECML98), New York: Springer-Verlag, 1998. 4~15 被引量:1
  • 2[2]Y Yang, X Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval, New York: ACM Press, 1999 被引量:1
  • 3[3]Y Yang, C G Chute. An example-based mapping method for text categorization and retrieval. ACM Trans on Information Systems, 1994, 12(3): 252~277 被引量:1
  • 4[4]E Wiener. A neural network approach to topic spotting. The 4th Annual Symp on Document Analysis and Information Retrieval (SDAIR 95), Las Vegas, NV, 1995 被引量:1
  • 5[5]R E Schapire, Y Singer. Improved boosting algorithms using confidence-rated predications. In: Proc of the 11th Annual Conf on Computational Learning Theory. Madison: ACM Press, 1998. 80~91 被引量:1
  • 6[6]T Joachims. Text categorization with support vector machines: Learning with many relevant features. In: The 10th European Conf on Machine Learning (ECML-98). Berlin: Springer, 1998. 137~142 被引量:1
  • 7[7]S O Belkasim, M Shridhar, M Ahmadi. Pattern classification using an efficient KNNR. Pattern Recognition Letter, 1992, 25(10): 1269~1273 被引量:1
  • 8[8]V E Ruiz. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letter, 1986, 4(3): 145~147 被引量:1
  • 9[9]P E Hart. The condensed nearest neighbor rule. IEEE Trans on Information Theory, 1968, IT-14(3): 515~516 被引量:1
  • 10[10]D L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans on Systems, Man and Cybernetics, 1972, 2(3): 408~421 被引量:1

共引文献134

同被引文献62

引证文献5

二级引证文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部