摘要
基因数据的特点是高维度、小样本、大噪声,在处理过程中容易造成维数灾难和过度拟合等问题。针对这种情况提出一种新的基因数据集的特征选择方法,第一步是通过ReliefF算法对基因特征进行权重重要度的筛选;第二步是对筛选过的特征集合进行mRMR算法判断,留下与目标类别高度相关而其间相关性较小的基因特征;第三步利用邻域粗糙集特征选择算法对简化后的基因数据集进行寻优处理,选出最优化的特征基因子集。为了证明新算法的有效性,以SVM为分类器,使用外部交叉验证法对整个过程来计算,从而验证本文新特征选择方法的有效性。
The characteristics of genetic data are high dimension,small sample and large noise,which are easy to cause dimensional disaster and over-fitting in the process of processing.In order to solve this problem,a novel feature selection method for gene data sets is proposed.The first step is to use ReliefF algorithm to screen the weight importance of the gene features.The second step is to use mRMR algorithm to judge the selected feature set,leaving the gene features highly correlated with the target category and less correlated.The third step is to use neighborhood rough set feature selection algorithm to optimize the simplified gene data set,selecting optimal subset of feature genes.To prove the effectiveness of the new algorithm,SVM is used as the classifier and the external cross-validation method is used to calculate the whole process to verify the effectiveness of the new feature selection method.
作者
马国娟
吴辰文
刘文祎
MA Guo-juan;WU Chen-wen;LIU Wen-yi(School of Electronics and Information,Lanzhou Jiaotong University,Lanzhou 730070,China)
出处
《测控技术》
2019年第10期71-75,共5页
Measurement & Control Technology
基金
国家自然科学基金资助项目(61163010)
兰州市科技计划资助项目(2015-2-99)