摘要
针对k-means聚类算法只能保证收敛到局部最优,导致聚类结果对初始聚类中心敏感的问题,提出了一种基于相似中心的文本聚类算法。首先,度量文档之间的相似性,然后按照文档之间的相似性递减排序,选择序列最前面的k个文档作为初始聚类中心,对于每个剩余的文档(没有被选为初始簇中心的文档)根据其与存在的簇中心的相似性,将其分配到相似性最大的簇中,更新簇均值,连续迭代,直至均值不变,从而得到更加稳定的聚类结果。实验结果表明,提出的算法在宏平均聚类精度和宏平均召回率上有显著提高,产生了质量较好的聚类效果。
The k-means clustering algorithm can only guarantee convergence to a local optimum, which led to the results of clustering is sensitive for initial clustering center, an improved centroid-based text clustering algorithm is proposed. First, the similarity between documents is calculated, then centers at the first k documents of the sequence is selected, which is sorted by similarity descending, according to similarity between every document which is not selected as initial cluster center and existent cluster center, assigned the document to a cluster having the largest similarity, updating cluster mean and iterating continuously until no change. Finally, the more stable clustering result is gotten. The comparison of experimental results show that the proposed algorithm performs is better in the marco average clustering precision and marco average recall rate, gets better quality of clustering results.
出处
《计算机工程与设计》
CSCD
北大核心
2010年第8期1802-1805,共4页
Computer Engineering and Design
基金
工信部2007电子信息产业发展基金项目(工信部运[2007]97号)
关键词
聚类
k-cmeans算法
相似性度量
宏平均聚类精度
宏平均召回率
clustering
k-cmeans algorithm
similarity measurement
marco average clustering precision
marco average recall rate