摘要
针对不完备信息系统的数据聚类问题,将集对分析理论引入k-means聚类中,同时为了更好地表示样本与类簇的关系,构建了一种面向不完备信息系统的集对k-means(Set pair k-means,SPKM)聚类算法。首先,基于集对理论提出了一种集对距离度量方法,并将该度量方法运用到k-means算法中,得到初步聚类结果;随后,对于同时属于多个类的样本,将其分配到相应类的边界域,对于只属于一个类的样本,将其分配到相应类的正同域或边界域,其中聚类结果由肯定属于该类簇的正同域、可能属于该类簇的边界域以及肯定不属于该类簇的负反域3个部分共同表示;最后通过选取UCI数据库中的6个数据集与4种对比算法进行实验评价。实验结果表明,SPKM算法在准确率、F1值、Jaccard系数、FMI和ARI等指标上均具有良好的聚类性能。
For the data clustering problem of incomplete information system,the set pair analysis theory is introduced into k-means clustering.At the same time,to better represent the relationship between the sample and the cluster,a set pair k-means(SPKM)clustering algorithm for incomplete information system is constructed.Firstly,a set pair distance measurement method is proposed according to set pair theory,and the measurement method is applied to the k-means algorithm to obtain the preliminary clustering results.Then,for samples belonging to multiple clusters at the same time,the samples are assigned into the boundary region of the corresponding clusters.And for samples belonging to only one cluster,it is assigned into the positive region or boundary region of the corresponding clusters.The clustering results are expressed by three parts,which are the positive region belonging to the cluster,the boundary region that may belong to the cluster and the negative region which does not belong to the cluster.Finally,six data sets in the UCI database and four contrast algorithms are selected for experimental evaluation.Experimental results show that the SPKM algorithm has good clustering performance in accuracy,F1 value,Jaccard coefficient,FMI and ARI.
作者
张春英
高瑞艳
刘凤春
王佳昊
陈松
冯晓泽
任静
ZHANG Chunying;GAO Ruiyan;LIU Fengchun;WANG Jiahao;CHEN Song;FENG Xiaoze;REN Jing(College of Science,North China University of Science and Technology,Tangshan,063210,China;Qian’an College,North China University of Science and Technology,Tangshan,063210,China;Key Laboratory of Data Science and Application of Hebei Province,Tangshan,063210,China)
出处
《数据采集与处理》
CSCD
北大核心
2020年第4期613-629,共17页
Journal of Data Acquisition and Processing
基金
河北省自然科学基金(F2018209374,F2016209344)资助项目。