摘要
提出了一种基于偏最小二乘判别分析和F-score的特征筛选方法,并将其用于蛋白质组学质谱数据分析。该方法主要包含3个步骤:(1)用LIMPIC算法对原始数据进行预处理;(2)计算每个变量的F-score值并将所有变量按F-score值降底的顺序排列;(3)采用偏最小二乘判别分析交互检验按前向选择法选择最佳变量子集。用本方法对一组结肠癌数据进行分析,最终从原始的16331个质荷比变量中选择了8个特征质荷比作为潜在的生物标记物。用所选择的特征对独立测试集的样本进行判别,其灵敏度和特异性分别达到了95.24%和100%。结果表明,所提出的方法可用于蛋白质组学质谱数据的特征筛选及样本分类。
A feature selection and sample classification method based on F-score and partial least square discriminant analysis (PLS-DA) was proposed and used for proteomic mass spectrometric (MS) data analysis and potential biomarker discovery. The method mainly includes 3 steps: (1) spectra preprocessing with LIMPIC algorithm; (2) calculating the F-score values for each variable and sort them according to their F-score values in descending order; and (3) determination of the optimum feature set with PLS-DA cross validation in a forward stepwise selection manner. A colorectal cancer dataset was analyzed with the proposed method. As results, 8 m/z locations were selected as potential biomarkers. The features could distinguish the disease samples from healthy controls on the independent test sets with 95.24% of sensitivity and 100% of specificity. The results show that the method proposed in this study is available for classification feature selection from proteomic MS data.
出处
《计算机与应用化学》
CAS
CSCD
北大核心
2012年第12期1467-1470,共4页
Computers and Applied Chemistry
关键词
特征选择
质谱
F-score
偏最小二乘判别分析
feature selection, mass spectra, F-score, partial least square-discriminant analysis