期刊文献+

不平衡数据的软子空间聚类算法 被引量:4

Soft subspace clustering algorithm for imbalanced data
下载PDF
导出
摘要 针对受均匀效应的影响,当前K-means型软子空间算法不能有效聚类不平衡数据的问题,提出一种基于划分的不平衡数据软子空间聚类新算法。首先,提出一种双加权方法,在赋予每个属性一个特征权重的同时,赋予每个簇反映其重要性的一个簇类权重;其次,提出一种混合型数据的新距离度量,以平衡不同类型属性及具有不同符号数目的类属型属性间的差异;第三,定义了基于双加权方法的不平衡数据子空间聚类目标优化函数,给出了优化簇类权重和特征权重的表达式。在实际应用数据集上进行了系列实验,结果表明,新算法使用的双权重方法能够为不平衡数据中的簇类学习更准确的软子空间;与现有的K-means型软子空间算法相比,所提算法提高了不平衡数据的聚类精度,在其中的生物信息学数据上可以取得近50%的提升幅度。 Aiming at the problem that the current K-means-type soft-subspace algorithms cannot effectively cluster imbalanced data due to uniform effect, a new partition-based algorithm was proposed for soft subspace clustering on imbalanced data. First, a bi-weighting method was proposed, where each attribute was assigned a feature-weight and each cluster was assigned a cluster-weight to measure its importance for clustering. Second, in order to make a trade-off between attributes with different types or those Categorical attributes having various numbers of categories, a new distance measurement was then proposed for mixed-type data. Third, an objective function was defined for the subspace clustering algorithm on imbalanced data based on the bi-weighting method, and the expressions for optimizing both the cluster-weights and feature-weights were derived. A series of experiments were conducted on some real-world data sets and the results demonstrated that the bi- weighting method used in the new algorithm can learn more accurate soft-subspace for the clusters hidden in the imbalanced data. Compared with the existing K-means-type soft-subspace clustering algorithms, the proposed algorithm yields higher clustering accuracy on imbalanced data, achieving about 50% improvements on the bioinformatie data used in the experiments.
出处 《计算机应用》 CSCD 北大核心 2017年第10期2952-2957,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61672157) 福建省自然科学基金资助项目(2015J01238)~~
关键词 软子空间聚类 不平衡数据 特征权重 簇类权重 soft subspace clustering imbalanced data feature weight cluster weight
  • 相关文献

参考文献3

二级参考文献43

  • 1陈宗海,文锋,聂建斌,吴晓曙.基于节点生长k-均值聚类算法的强化学习方法[J].计算机研究与发展,2006,43(4):661-666. 被引量:13
  • 2Han Jiawei,Kamber M.Data Mining Concepts and Techniques[M].San Francisco:Morgan Kaufmann,2001. 被引量:1
  • 3Brendan J F,Delbert D.Clustering by passing messages between data points[J].Science,2007,315(16):972-976. 被引量:1
  • 4Zhang Jiangshe,Liang Yiuwing.Improved possibilistic c-means clustering algorithms[J].IEEE Trans on Fuzzy Systems,2004,12(2):209-217. 被引量:1
  • 5Mac Q J.Some methods for classification and analysis of multivariate observation[C]//Proc of the 5th Berkley Symp on Mathematical Statistics and Probability.Berkley,California:University of California Press,1967:281-297. 被引量:1
  • 6Huang Zhexue.Clustering large data sets with mixed numeric and categorical values[C]//Proc of PAKDD97.Singapore:World Scientific,1997:21-35. 被引量:1
  • 7Huang Zhexue.Extensions to the K-means algorithm for clustering large data sets with categorical values[J].Data Mining and Knowledge Discovery,1998,2(3):283-304. 被引量:1
  • 8Ng M K,Li Junjie,Huang Zhexue,et al.On the impact of dissimilarity measure in K-modes clustering algorithm[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2007,29(3):503-507. 被引量:1
  • 9San O M,Huynh V N,Nakamori Y.An alternative extension of the K-means algorithm for clustering categorical data[J].Int Journal Application Mathematic and Computer Science,2004,14(2):241-247. 被引量:1
  • 10Li Cen,Biswas G.Unsupervised learning with mixed numeric and nominal data[J].IEEE Trans on Knowledge and Data Engineering,2002,14(4):673-690. 被引量:1

共引文献77

同被引文献23

引证文献4

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部