摘要
为了提高噪声污染数据的聚类效果及质量,提出了一种基于k-Means均匀效应的健壮聚类初始化算法.k-Means聚类结果中各子簇样本量均匀一致,导致其中稀疏子簇范围大、稠密子簇范围小以及相邻稠密子簇范围相当等关系.算法利用超过实际聚类数的k-Means算法,以便获得上述子簇范围关系,通过合并邻近小子簇、丢弃稀疏的大子簇,自动获得相似样本簇并有效地消除噪声,从而实现健壮的聚类初始化.理论和实验证明了该算法的有效性.
On the basis of k-Means clustering's uniform effect, a new robust clustering initialization algorithm is proposed to improve the clustering quality of an outlier-contaminated dataset. The uniform effect of k-Means can assure certain relationships between clusters that, clusters lying in any sparse sample all have big sizes, clusters lying in any dense area are all of small sizes, and neighbor clusters in dense area have comparable sizes. The algorithm first partition a dataset using k-Means method with an excessive cluster number, to easily obtain the above size relationships between clusters. Then, by merging those small-size clusters lying in the neighborhood, the algorithm obtains dense sample areas in the data space, which can be set as initial clusters. Outliers, however, distribute very sparsely, most of which are clustered into big-size clusters, and thus they affect the initialization process very little. Theoretic analysis and various experiments show the effectiveness of the proposed algorithm.
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2010年第8期73-76,共4页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
国家自然科学基金资助项目(60933009)