摘要
KNN(k Nearest Neighbor)算法是一种简单、有效、非参数的文本分类方法.传统的KNN方法有着样本相似度计算量大的明显缺陷,使其在具有大量高维样本的文本分类中缺乏实用性.提出了一种快速查找精确的k个最近邻的TKNN(Tree-k-Nearest-Neighbor)算法,该算法建立一棵用于查找的树,加速k个最近邻的查找.首先以整个样本集合中心为基准,按照距离中心的距离将所有样本进行排序,并等分L组,作为根结点的孩子,每个孩子以同样方式处理,直到每组样本数量在[k,2k]间为止.根据这棵树查找k个最近邻,减小了查找范围,极大地降低了相似度计算量.
The KNN is a simple, valid and non-parameter method applied to text categorization. The traditional KNN has a fatal defect that time of similarity computing is huge. The practicality will be lost when the KNN is applied to text categorization with high dimension and huge samples. In this paper, a method called TKNN(Tree-k-Nearest-Neighbor) is presented which can search the k nearest neighbors quickly. A tree for searching k nearest neighbors is created; subsequently the searching speed is quicken. First, all samples are sorted based on the similarity between itself and the central sample, then the sorted queue is divided into L groups equably. One group is a child of the root, and every child is disposed like this until the count of a group between k and 2k. Then the searching scope is reduced based on the tree. Subsequently the time of similarity computing is decreased largely.
出处
《河北大学学报(自然科学版)》
CAS
北大核心
2008年第3期322-326,共5页
Journal of Hebei University(Natural Science Edition)
关键词
KNN
文本分类
相似度
KNN
text categorization
similarity