摘要
利用标准的分类测试集合进行聚类质量的量化评价,选择了k-Means聚类算法、STC(后缀树聚类)算法和基于Ant的聚类算法进行了实验对比.实验结果分析表明,STC聚类算法由于在处理文本时充分考虑了文本的短语特性,其聚类效果较好;基于Ant的聚类算法的结果受参数输入的影响较大;在Ant聚类算法中引入文本特性可以提高聚类结果的质量.
Textual document clustering huge textual document set. Clustering is one of the effective approaches Validation or Quality Evaluation to establish a classification instance of a techniques can be used to assess the efficiency and effectiveness of a clustering algorithm. This paper presents the quality evaluation criterions. Based on these criterions we take three typical textual document clustering algorithms for assessment with experiments. The comparison results show that STC(Suffix Tree Clustering) algorithm is better than k-Means and Ant-Based clustering algorithms. The better performance of STC algorithm comes from that it takes into account the linguistic property when processing the documents. Ant-Based clustering algorithm's performance variation is affected by the input variables. It is necessary to adopt linguistic properties to improve the Ant-Based text clustering's performance.
出处
《中国科学院研究生院学报》
CAS
CSCD
2006年第5期640-646,共7页
Journal of the Graduate School of the Chinese Academy of Sciences
基金
国家科技部"国家重点实验室网上合作研究平台"项目(2003DEA5G0407)资助