摘要
[目的]分析基于最小编码长度的基因数据聚类算法的聚类效果,以期为基因数据聚类提供新的方法。[方法]将基因数据的聚类看成是高维混合数据的聚类,通过对基因数据进行预处理后,再利用主成分分析将基因数据降维,降维后基因数据呈类高斯分布,这样分布的基因数据能够被一个简单的基于有损数据压缩的聚类算法进行有效的聚类,而该基于有损数据压缩的聚类算法是根据聚类后使基因的总体编码长度最小原则对基因进行聚类的。试验中分别利用该新算法与传统聚类算法对酵母和拟南芥基因数据进行聚类,并通过基因聚类内部评价和功能评价来验证该新算法的有效性。[结果]通过利用酵母和拟南芥基因数据对新算法的验证试验表明,该研究中的新算法得到的聚类效果优于传统聚类算法,且避免了聚类数需要主观确定和对初始聚类中心敏感等问题。[结论]该研究结果为基因数据聚类提供了一种全新的聚类方法。
[Objective] This paper aimed to provide new method for genetic data clustering by analyzing the clustering effect of genetic data clustering algorithm based on minimum coding length.[Method] The genetic data clustering was regarded as high dimension mixed data clustering.After preprocessing genetic data,the dimensions of the genetic data were reduced by principal component analysis,when genetic data presented Gaussian-like distribution.This distribution of genetic data could be clustered effectively through lossy data compression,which clustered the genes based on a simple clustering algorithm.This algorithm could achieve its best clustering result when the length of the codes of clustered genes reached its minimum value.This algorithm and the traditional clustering algorithms were used to do the genetic data clustering of yeast and Arabidopsis,and the effectiveness of the algorithm was verified through genetic clustering internal evaluation and function evaluation.[Result] The clustering effect of the new algorithm in this study was superior to traditional clustering algorithm,and it also avoided the problems of objective determination of clustering data and sensitiveness to initial clustering center.[Conclusion] This study provides a new clustering method for the genetic data clustering.
出处
《安徽农业科学》
CAS
2012年第19期10003-10005,10072,共4页
Journal of Anhui Agricultural Sciences
关键词
基因聚类
有损压缩
高斯分布
最小编码长度
Genetic clustering
Lossy compression
Gaussian distribution
Minimum coding length