Background The major difficulty in the research of DNA microarray data is the large number of genes compared with the relatively small number of samples as well as the complex data structure. Random forest has receive...Background The major difficulty in the research of DNA microarray data is the large number of genes compared with the relatively small number of samples as well as the complex data structure. Random forest has received much attention recently; its primary characteristic is that it can form a classification model from the data with high dimensionality. However, optimal results can not be obtained for gene selection since it is still affected by undifferentiated genes. We proposed recursive random forest analysis and applied it to gene selection. Methods Recursive random forest, which is an improvement of random forest, obtains optimal differentiated genes after step by step dropping of genes which, according to a certain algorithm, have no effects on classification. The method has the advantage of random forest and provides a gene importance scale as well. The value of the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which synthesizes the information of sensitivity and specificity, is adopted as the key standard for evaluating the performance of this method. The focus of the paper is to validate the effectiveness of gene selection using recursive random forest through the analysis of five microarray datasets; colon, prostate, leukemia, breast and skin data. Results Five microarray datasets were analyzed and better classification results have been attained using only a few genes after gene selection. The biological information of the selected genes from breast and skin data was confirmed according to the National Center for Biotechnology Information (NCBI). The results prove that the genes associated with diseases can be effectively retained by recursive random forest. Conclusions Recursive random forest can be effectively applied to microarray data analysis and gene selection. The retained genes in the optimal model provide important information for clinical diagnoses and research of the biological mechanism of diseases.展开更多
Microarray data are often extremely asymmetric in dimensionality, such as thousands or even tens of thousands of genes but only a few hundreds of samples or less. Such extreme asymmetry between the dimensionality of g...Microarray data are often extremely asymmetric in dimensionality, such as thousands or even tens of thousands of genes but only a few hundreds of samples or less. Such extreme asymmetry between the dimensionality of genes and samples can lead to inaccurate diagnosis of disease in clinic. Therefore, it has been shown that selecting a small set of marker genes can lead to improved classification accuracy. In this paper, a simple modified ant colony optimization (ACO) algorithm is proposed to select tumorelated marker genes, and support vector machine (SVM) is used as classifier to evaluate the performance of the extracted gene subset. Experimental results on several benchmark tumor microarray datasets showed that the proposed approach produces better recognition with fewer marker genes than many other methods. It has been demonstrated that the modified ACO is a useful tool for selecting marker genes and mining high dimension data展开更多
Understanding how human cardiomyocytes mature is crucial to realizing stem cell-based heart regeneration, modeling adult heart diseases, and facilitating drug discovery. However, it is not feasible to analyze human sa...Understanding how human cardiomyocytes mature is crucial to realizing stem cell-based heart regeneration, modeling adult heart diseases, and facilitating drug discovery. However, it is not feasible to analyze human samples for maturation due to inaccessibility to samples while cardiomyocytes mature during fetal development and childhood, as well as difficulty in avoiding variations among individuals. Using model animals such as mice can be a useful strategy; nonetheless, it is not well-understood whether and to what degree gene expression profiles during maturation are shared between humans and mice. Therefore, we performed a comparative gene expression analysis of mice and human samples. First, we examined two distinct mice microarray platforms for shared gene expression profiles, aiming to increase reliability of the analysis. We identified a set of genes displaying progressive changes during maturation based on principal component analysis. Second, we demonstrated that the genes identified had a differential expression pattern between adult and earlier stages (e.g., fetus) common in mice and humans. Our findings provide a foundation for further genetic studies of cardiomyocyte maturation.展开更多
Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists...Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists during the classification process.More than two decades ago,researchers used microarray gene expression data to classify cancer and adopted acute leukemia as a test case.The high classification accuracy they achieved confirmed that it is possible to classify cancer subtypes using microarray gene expression data.Ensemble machine learning is an effective method that combines individual classifiers to classify new samples.Ensemble classifiers are recognized as powerful algorithms with numerous advantages over traditional classifiers.Over the past few decades,researchers have focused a great deal of attention on ensemble classifiers in a wide variety of fields,including but not limited to disease diagnosis,finance,bioinformatics,healthcare,manufacturing,and geography.This paper reviews the recent ensemble classifier approaches utilized for acute leukemia gene expression data classification.Moreover,a framework for classifying acute leukemia gene expression data is proposed.The pairwise correlation gene selection method and the Rotation Forest of Bayesian Networks are both used in this framework.Experimental outcomes show that the classification accuracy achieved by the acute leukemia ensemble classifiers constructed according to the suggested framework is good compared to the classification accuracy achieved in other studies.展开更多
This paper is devoted to identifying the biomarkers of rat liver regeneration via the adaptive logistic regression. By combining the adaptive elastic net penalty with the logistic regression loss, the adaptive logisti...This paper is devoted to identifying the biomarkers of rat liver regeneration via the adaptive logistic regression. By combining the adaptive elastic net penalty with the logistic regression loss, the adaptive logistic regression is proposed to adaptively identify the important genes in groups. Furthermore, by improving the pathwise coordinate descent algorithm, a fast solving algorithm is developed for computing the regularized paths of the adaptive logistic regression. The results from the experiments performed on the microarray data of rat liver regeneration are provided to illustrate the effectiveness of the proposed method and verify the biological rationality of the selected biomarkers.展开更多
Microarray data based tumor diagnosis is a very interesting topic in bioinformatics. One of the key problems is the discovery and analysis of informative genes of a tumor. Although there are many elaborate approaches ...Microarray data based tumor diagnosis is a very interesting topic in bioinformatics. One of the key problems is the discovery and analysis of informative genes of a tumor. Although there are many elaborate approaches to this problem, it is still difficult to select a reasonable set of informative genes for tumor diagnosis only with microarray data. In this paper, we classify the genes expressed through microarray data into a number of clusters via the distance sensitive rival penalized competitive learning (DSRPCL) algorithm and then detect the informative gene cluster or set with the help of support vector machine (SVM). Moreover, the critical or powerful informative genes can be found through further classifications and detections on the obtained informative gene clusters. It is well demonstrated by experiments on the colon, leukemia, and breast cancer datasets that our proposed DSRPCL-SVM approach leads to a reasonable selection of informative genes for tumor diagnosis.展开更多
目的使用高斯核函数和欧式距离函数改进微阵列显著分析法(significance analysis of microarray,SAM)得到MSAM1法(modified significance analysis of microarray-1,MSAM1)和MSAM2法(modified significance analysis of microarray-2,MS...目的使用高斯核函数和欧式距离函数改进微阵列显著分析法(significance analysis of microarray,SAM)得到MSAM1法(modified significance analysis of microarray-1,MSAM1)和MSAM2法(modified significance analysis of microarray-2,MSAM2),与SAM法、Relief法、支持向量机递归特征消除法(support vector machine recursive feature elimination, SVM-RFE)进行对比,评价在基因表达数据中MSAM1法、MSAM2法的基因选择和分类预测能力。方法从Bioconductor中的golubEsets包获得leukemia数据集(Golub等人给出了该数据集所包含的50个差异基因),运用R软件实现5种算法,分别用正确率和ROC曲线下面积即AUC值评价基因选择能力和分类预测能力,用Kruskal-Wallis H检验比较5种方法的正确率和AUC值的组间差异,进一步的两两比较采用SNK-q检验。结果正确率和AUC值均表现为MSAM1和MSAM2最优,SAM和SVM-RFE法次之,Relief法排在最后;5种方法的组间差异有统计学意义(H=150.333,P<0.0001和H=293.2579,P<0.0001),两两比较结果显示虽然MSAM1和MSAM2之间差异无统计学意义(P>0.05),但两种方法与其他3种方法之间差异均有统计学意义(P<0.05)。结论用高斯核函数和欧式距离函数改进的加权SAM法提高了SAM法的基因选择和分类预测能力,在实际基因表达数据的应用中可以得到更为稳定的分析结果。展开更多
A gene selection algorithm was developed using Multiple Principal Component Analysis with Sparsity (MSPCA). The MSPCA algorithm is used to analyze normal and disease gene expression samples and to set these componen...A gene selection algorithm was developed using Multiple Principal Component Analysis with Sparsity (MSPCA). The MSPCA algorithm is used to analyze normal and disease gene expression samples and to set these component Ioadings to zero if they are smaller than a threshold for sparse solutions. Next, genes with zero Ioadings across all samples (both normal and disease) are removed before extracting feature genes. Feature genes are genes that contribute differentially to variations in normal and disease samples and, thus, can be used for classification. The MSPCA is applied to three microarray datasets to select feature genes with a linear support vector machine to evaluate its performance. This method is compared with several previous gene selection results to show that this MSPCA gene selection algorithm has good classification accuracy and model stability.展开更多
一个微阵列数据集包含了成千上万的基因、相对少量的样本,而在这成千上万的基因中,只有一少部分基因对肿瘤分类是有贡献的,因此,对于肿瘤分类来说,最重要的一个问题就是识别选择出对肿瘤分类最有贡献的基因。为了能有效地进行微阵列基...一个微阵列数据集包含了成千上万的基因、相对少量的样本,而在这成千上万的基因中,只有一少部分基因对肿瘤分类是有贡献的,因此,对于肿瘤分类来说,最重要的一个问题就是识别选择出对肿瘤分类最有贡献的基因。为了能有效地进行微阵列基因选择,提出用一个边缘分布模型(marginal distribution model,MDM)来描述微阵列数据。该模型不仅能区分基因是否在两样本中差异表达,而且能区分出基因在哪一类样本中表达,从而选择出的基因更具有生物学意义。模拟数据及真实微阵列数据集上的实验结果表明,该方法能有效地进行微阵列基因选择。展开更多
基金The project was supported by a grant from the National Natural Science Foundation of China (No. 30371253).Acknowledgement: We sincerely appreciate the comments from Edgar J. Love.
文摘Background The major difficulty in the research of DNA microarray data is the large number of genes compared with the relatively small number of samples as well as the complex data structure. Random forest has received much attention recently; its primary characteristic is that it can form a classification model from the data with high dimensionality. However, optimal results can not be obtained for gene selection since it is still affected by undifferentiated genes. We proposed recursive random forest analysis and applied it to gene selection. Methods Recursive random forest, which is an improvement of random forest, obtains optimal differentiated genes after step by step dropping of genes which, according to a certain algorithm, have no effects on classification. The method has the advantage of random forest and provides a gene importance scale as well. The value of the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which synthesizes the information of sensitivity and specificity, is adopted as the key standard for evaluating the performance of this method. The focus of the paper is to validate the effectiveness of gene selection using recursive random forest through the analysis of five microarray datasets; colon, prostate, leukemia, breast and skin data. Results Five microarray datasets were analyzed and better classification results have been attained using only a few genes after gene selection. The biological information of the selected genes from breast and skin data was confirmed according to the National Center for Biotechnology Information (NCBI). The results prove that the genes associated with diseases can be effectively retained by recursive random forest. Conclusions Recursive random forest can be effectively applied to microarray data analysis and gene selection. The retained genes in the optimal model provide important information for clinical diagnoses and research of the biological mechanism of diseases.
基金partially supported by National Natural Science Foundation of China (Grant No.60873036)China Postdoctoral Science Foundation(Grant No. 20060400809) Science and Technology Special Foundation for Young Researchers of Hei-longjiang Province of China (Grant No. QC06C022)
文摘Microarray data are often extremely asymmetric in dimensionality, such as thousands or even tens of thousands of genes but only a few hundreds of samples or less. Such extreme asymmetry between the dimensionality of genes and samples can lead to inaccurate diagnosis of disease in clinic. Therefore, it has been shown that selecting a small set of marker genes can lead to improved classification accuracy. In this paper, a simple modified ant colony optimization (ACO) algorithm is proposed to select tumorelated marker genes, and support vector machine (SVM) is used as classifier to evaluate the performance of the extracted gene subset. Experimental results on several benchmark tumor microarray datasets showed that the proposed approach produces better recognition with fewer marker genes than many other methods. It has been demonstrated that the modified ACO is a useful tool for selecting marker genes and mining high dimension data
基金supported by Maryland Stem Cell Research Fund, USA (Grant No. 2015-MSCRFF-1765)supported by the grants from the Ministry of Education, Science (grant No. KAKENHI 26120528) and Chuo University joint research grant
文摘Understanding how human cardiomyocytes mature is crucial to realizing stem cell-based heart regeneration, modeling adult heart diseases, and facilitating drug discovery. However, it is not feasible to analyze human samples for maturation due to inaccessibility to samples while cardiomyocytes mature during fetal development and childhood, as well as difficulty in avoiding variations among individuals. Using model animals such as mice can be a useful strategy; nonetheless, it is not well-understood whether and to what degree gene expression profiles during maturation are shared between humans and mice. Therefore, we performed a comparative gene expression analysis of mice and human samples. First, we examined two distinct mice microarray platforms for shared gene expression profiles, aiming to increase reliability of the analysis. We identified a set of genes displaying progressive changes during maturation based on principal component analysis. Second, we demonstrated that the genes identified had a differential expression pattern between adult and earlier stages (e.g., fetus) common in mice and humans. Our findings provide a foundation for further genetic studies of cardiomyocyte maturation.
文摘Acute leukemia is an aggressive disease that has high mortality rates worldwide.The error rate can be as high as 40%when classifying acute leukemia into its subtypes.So,there is an urgent need to support hematologists during the classification process.More than two decades ago,researchers used microarray gene expression data to classify cancer and adopted acute leukemia as a test case.The high classification accuracy they achieved confirmed that it is possible to classify cancer subtypes using microarray gene expression data.Ensemble machine learning is an effective method that combines individual classifiers to classify new samples.Ensemble classifiers are recognized as powerful algorithms with numerous advantages over traditional classifiers.Over the past few decades,researchers have focused a great deal of attention on ensemble classifiers in a wide variety of fields,including but not limited to disease diagnosis,finance,bioinformatics,healthcare,manufacturing,and geography.This paper reviews the recent ensemble classifier approaches utilized for acute leukemia gene expression data classification.Moreover,a framework for classifying acute leukemia gene expression data is proposed.The pairwise correlation gene selection method and the Rotation Forest of Bayesian Networks are both used in this framework.Experimental outcomes show that the classification accuracy achieved by the acute leukemia ensemble classifiers constructed according to the suggested framework is good compared to the classification accuracy achieved in other studies.
基金supported by National Nature Science Foundation of China(No.61203293)Key Scientific and Technological Project of Henan Province(No.122102210131)+3 种基金Program for Science and Technology Innovation Talents in Universities of Henan Province(No.13HASTIT040)Foundation of Henan Educational Committee(No.13A120524)Henan Normal University Doctoral Topics(No.qd14156)Henan Higher School Funding Scheme for Young Teachers(No.2012GGJS-063)
文摘This paper is devoted to identifying the biomarkers of rat liver regeneration via the adaptive logistic regression. By combining the adaptive elastic net penalty with the logistic regression loss, the adaptive logistic regression is proposed to adaptively identify the important genes in groups. Furthermore, by improving the pathwise coordinate descent algorithm, a fast solving algorithm is developed for computing the regularized paths of the adaptive logistic regression. The results from the experiments performed on the microarray data of rat liver regeneration are provided to illustrate the effectiveness of the proposed method and verify the biological rationality of the selected biomarkers.
基金the National Natural Sci-ence Foundation of China (Grant No. 60471054)President Foundation of Peking University.
文摘Microarray data based tumor diagnosis is a very interesting topic in bioinformatics. One of the key problems is the discovery and analysis of informative genes of a tumor. Although there are many elaborate approaches to this problem, it is still difficult to select a reasonable set of informative genes for tumor diagnosis only with microarray data. In this paper, we classify the genes expressed through microarray data into a number of clusters via the distance sensitive rival penalized competitive learning (DSRPCL) algorithm and then detect the informative gene cluster or set with the help of support vector machine (SVM). Moreover, the critical or powerful informative genes can be found through further classifications and detections on the obtained informative gene clusters. It is well demonstrated by experiments on the colon, leukemia, and breast cancer datasets that our proposed DSRPCL-SVM approach leads to a reasonable selection of informative genes for tumor diagnosis.
基金Supported by the Doctoral Fund of Chinese Ministry of Education (No.20113514120007)the Nature Science Fund of Fujian Province in China (No.2010J05132)the Science and Technology Fund of Educational Office of Fujian Province, China (No.JA10034)
文摘A gene selection algorithm was developed using Multiple Principal Component Analysis with Sparsity (MSPCA). The MSPCA algorithm is used to analyze normal and disease gene expression samples and to set these component Ioadings to zero if they are smaller than a threshold for sparse solutions. Next, genes with zero Ioadings across all samples (both normal and disease) are removed before extracting feature genes. Feature genes are genes that contribute differentially to variations in normal and disease samples and, thus, can be used for classification. The MSPCA is applied to three microarray datasets to select feature genes with a linear support vector machine to evaluate its performance. This method is compared with several previous gene selection results to show that this MSPCA gene selection algorithm has good classification accuracy and model stability.
文摘一个微阵列数据集包含了成千上万的基因、相对少量的样本,而在这成千上万的基因中,只有一少部分基因对肿瘤分类是有贡献的,因此,对于肿瘤分类来说,最重要的一个问题就是识别选择出对肿瘤分类最有贡献的基因。为了能有效地进行微阵列基因选择,提出用一个边缘分布模型(marginal distribution model,MDM)来描述微阵列数据。该模型不仅能区分基因是否在两样本中差异表达,而且能区分出基因在哪一类样本中表达,从而选择出的基因更具有生物学意义。模拟数据及真实微阵列数据集上的实验结果表明,该方法能有效地进行微阵列基因选择。