摘要
对蛋白质质谱数据进行模式识别成为癌症诊断的一种新方法,但质谱数据存在高维小样本问题,因而数据分析面临着巨大挑战.在对原始数据进行基线校正与标准化并用分箱法进行降维预处理的基础上,提出用T检验方法选取特征,对蛋白质质谱数据进行分析研究.实验对卵巢质谱数据集进行分类,用10-fold交叉验证法选择训练和测试样本,以支持向量机为分类器,实验结果表明提出的方法不仅选取特征子集小而且识别率高,其敏感性、特异性和综合识别率分别达到100%、96.7%和98.8%.
The pattern analysis to protein mass spectrometry data becomes a new method of cancer diagnosis.But there exists high dimensional and small sample size problem in protein mass spectrometry data,which brings a big challenge to data analysis.Based on dimension reduction preprocessing to raw data by using baseline correction and binning standardization,propose T test to select features to analysis protein mass spectrometry data.In the experiment classify ovarian mass dataset,use 10-fold cross validation to get training and testing data and use SVM as the classifier,the results shows the method propose only selects a small feature subset,and have a very high recognition rate.Its Sensitivity,specificity,and overall recognition rate has reached 100%,96.7% and 98.8%.
出处
《淮阴师范学院学报(自然科学版)》
CAS
2011年第5期409-413,共5页
Journal of Huaiyin Teachers College;Natural Science Edition
关键词
蛋白质质谱
分箱法
T-检验
支持向量机
protein mass spectrometry
binning
T-test
support vector machine