摘要
传统聚类算法在计算两个对象间的距离时,每个属性对距离的贡献相同。COSA(Clustering On Subsets of Attributes)算法[1]认为在不同的分组中,每个属性对计算距离所起的作用可能并不相等,因为不同分组中的对象可能在不同的属性子集上聚集。文献[1]在此基础上定义了新的距离,并提出了两种COSA算法:COSA1算法是一种分割的聚类算法;COSA2算法是一种层次聚类算法。为了对比COSA距离和传统的欧氏距离在文本聚类中的表现,本文对中文文本进行了分割聚类和层次聚类的实验。实验结果显示出COSA算法较基于欧氏距离的聚类算法有更好的性能,而且对于属性数的变化,COSA算法更加稳定。
Most traditional clustering algorithms treat each attribute equally. However, COSA (clustering on sub- sets of attributes) algorithm believes that each separate attribute in different groups may have different weight, and that objects in different groups may cluster in different subsets of attributes. A new distance definition is presented in literature [1], which also presented two COSA algorithms. COSA1 is a partitioning algorithm and COSA2 is a hierarchical cluster algorithm. In this paper, COSA and COSA1 were used for Chinese documents in order to compare the COSA distance and the Euclidean distance. The results show that COSA algorithms achieve better performance and are more robust when the number of attributes changes.
出处
《中文信息学报》
CSCD
北大核心
2007年第6期65-70,共6页
Journal of Chinese Information Processing
基金
国家863计划(2006AA01Z142)