期刊文献+

类别不平衡的分类方法及在生物信息学中的应用 被引量:26

A Classification Method for Class-Imbalanced Data and Its Application on Bioinformatics
下载PDF
导出
摘要 提出一种处理正反例不平衡的分类方法,以解决生物信息学中的snoRNA识别、microRNA前体判别、SNP位点的真伪识别等问题.利用集成学习的思想,将反例集均匀分割并依次与正例集组合,得到一组类别平衡的训练集.然后对每个训练集采用不同原理的分类器进行训练,最后投票表决待测样本.为了避免弱分类器影响投票效果,结合AdaBoost思想,将每个分类器训练中产生的错误样本加入到下2个分类器的训练集中,既避免了AdaBoost的反复训练,又有效地利用投票机制遏制了弱分类器的影响.5组UCI测试数据和3组生物信息学实验证明了它在处理类别不平衡分类问题时的优越性. A classification method is proposed for class-imbalanced data,which is common in bioinformatics,such as identifying snoRNA,classifying microRNA precursors from pseudo ones,mining SNPs from EST sequences,etc.It is based on the main idea of ensemble learning.First,the big class set is divided randomly into several subsets equally,and it is made sure that every subset together with the small class set can make up a class-balanced training set.Then several different mechanism classifiers are selected and trained with these balanced training sets.After the multi-classifiers are built,they will vote for the last prediction when dealing with new samples.In the training phase,a strategy similar to AdaBoost is used.For each classifier,the samples will be added to the training sets of next two classifiers if they are misclassified.It is necessary to repeat modifying the training sets until a classifier can accurately predict its training set or reaching the maximum repeat times.This strategy can improve the performance of weak classifiers by voting.Experiments on five UCI data sets and three bioinformatics experiments mentioned above prove the performance of the method.Furthermore,a software program named LibID,which can be used as similarly as LibSVM,is developed for the researchers from bioinformatics and other fields.
出处 《计算机研究与发展》 EI CSCD 北大核心 2010年第8期1407-1414,共8页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60741001 60871092 60932008) 黑龙江省杰出青年科学基金项目(JC200611) 黑龙江省自然科学基金重点项目(ZJG0705)~~
关键词 生物信息学 类别不平衡 非编码RNA识别 SNP位点鉴别 分类 bioinformatics class imbalance ncRNA identification mining SNP from EST classification
  • 相关文献

参考文献4

二级参考文献65

  • 1刘涵,郭勇,郑岗,刘丁.基于最小二乘支持向量机的图像边缘检测研究[J].电子学报,2006,34(7):1275-1279. 被引量:17
  • 2苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:386
  • 3方景龙,陈铄,潘志庚,梁荣华.复杂分类问题支持向量机的简化[J].电子学报,2007,35(5):858-861. 被引量:9
  • 4[2]Y Yang,JO Pedersen.A comparative study on feature selection in text categorization.In:Proc of the 14th Int'lConf on Machine Learning (ICML-97).San Francisco:Morgan Kaufmann Publishers,1997.412-420 被引量:1
  • 5[3]NV Chawla,N Japkowicz,A Kotcz.Editorial:Special issue on learning from imbalanced data sets.SIGKDD Explorations Newsletters,2004,6(1):1-6 被引量:1
  • 6[4]D Mladenic,M Grobelnk.Feature selection for unbalanced class distribution and naive bayes.In:Proc of the 16th Int'lConf on Machine Learning (ICML'99).San Francisco:Morgan Kaufmann Publishers,1999.258-267 被引量:1
  • 7[6]Bong,Chih How,K Narayanan.An empirical study of feature selection for text categorization based on term weightage.IEEE/WIC/ACM Int'lConf on Web Intelligence(WI'04),Beijing,2004 被引量:1
  • 8[7]Shoushan Li,Chengqing Zong.A new approach to feature selection for text categorization.IEEE Int'lConf on Natural Language Processing and Knowledge Engineering (NLP-KE),Wuhan,2005 被引量:1
  • 9[8]Castillo MDd,Serrano JI.A multistrategy approach for digital text categorization from imbalanced documents.SIGKDD Explorations Newsletter,2004,6(1):70-79 被引量:1
  • 10[9]Z Zheng,X Wu,R Srihari.Feature selection for text categorization on imbalanced data.SIGKDD Explorations,2004,6(1):80-89 被引量:1

共引文献79

同被引文献193

引证文献26

二级引证文献170

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部