摘要
在对高维少样本的遗传数据进行单核苷酸多态性(SNP)选择时,为能使所选SNP子集高度代表所有SNP信息,实现数据降维,在模糊C均值(FCM)算法的基础上提出一种改进方法GN-FCM。通过引入SNP权重因子量化SNP位点重要程度的差异性,同时将重点SNP邻域正则项引入模糊聚类的损失函数中,挖掘高度重要SNP与同邻域内其他SNP的关联性。实验结果表明,GN-FCM具有较好的收敛性,与DW-FCM算法相比,其构造的SNP子集在支持向量机、决策树和朴素贝叶斯分类中准确率分别提升5.73 %、3.40 %和3.79 %,F1值分别提升4.01 %、 3.20 %和 2.22 %。
In the selection of Single Nucleotide Polymorphism(SNP) from high-dimensional genetic data with few samples,in order to make the selected SNP subset highly represent all SNP information and achieve data dimension reduction,an improved method is proposed on the basis of Fuzzy C-Mean(FCM) algorithm,which is named GN-FCM.By introducing the weight factor of SNP,the difference of importance degree of SNP site is quantified.Meanwhile,the neighborhood regular term of key SNP is introduced into the loss function of fuzzy clustering,so as to mine the correlation between highly important SNP and other SNPs in the neighborhood.Experimental results show that GN-FCM has better convergence.Compared with DW-FCM algorithm,the accuracy of the constructed SNP subsets by this algorithm in Support Vector Machine(SVM),Decision Tree(DT) and Na ve Bayesian(NB) classification is improved by 5.73 %, 3.40 % and 3.79 % respectively,and the F1 value is improved by 4.01 %,3.20 % and 2.22 % respectively.
作者
张波
周从华
张付全
张婷
蒋跃明
ZHANG Bo;ZHOU Conghua;ZHANG Fuquan;ZHANG Ting;JIANG Yueming(School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China;Wuxi Mental Health Center,Wuxi,Jiangsu 214151,China;Wuxi Hospital for Maternity and Child Health Care Hospital,Wuxi,Jiangsu 214002,China;Wuxi No.5 People’s Hospital,Wuxi,Jiangsu 214073,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2019年第8期66-74,共9页
Computer Engineering
基金
江苏省重点研发计划社会发展项目(BE2016630,BE2017628)
无锡市卫生计生委科研项目(Z201603)
关键词
单核苷酸多态性选择
模糊聚类
特征选择
支持向量机
决策树
朴素贝叶斯分类
Single Nucleotide Polymorphism(SNP) selection
fuzzy clustering
feature selection
Support Vector Machine(SVM)
Decision Tree(DT)
Na ve Bayesian(NB) classification