酶功能的识别对理解生命活动的机制、推进生命科学的发展有重要作用。然而现有的酶EC编号预测方法,并未充分利用蛋白质序列信息,在识别精度上仍有所不足。针对上述问题,本研究提出一种基于层级特征和全局特征的EC编号预测网络(EC number...酶功能的识别对理解生命活动的机制、推进生命科学的发展有重要作用。然而现有的酶EC编号预测方法,并未充分利用蛋白质序列信息,在识别精度上仍有所不足。针对上述问题,本研究提出一种基于层级特征和全局特征的EC编号预测网络(EC number prediction network using hierarchical features and global features,ECPN-HFGF)。该方法首先通过残差网络提取蛋白质序列通用特征,并通过层级特征提取模块和全局特征提取模块进一步提取蛋白质序列的层级特征和全局特征,之后结合两种特征信息的预测结果,采用多任务学习框架,实现酶EC编号的精确预测。计算实验结果表明,ECPN-HFGF方法在蛋白质序列EC编号预测任务上性能最佳,宏观F1值和微观F1值分别达到95.5%和99.0%。ECPN-HFGF方法能有效结合蛋白质序列的层级特征和全局特征,快速准确预测蛋白质序列EC编号,比当前常用方法预测精确度更高,能够为酶学研究和酶工程应用的发展提供一种高效的思路和方法。展开更多
Protein-protein interactions(PPIs)are of great importance to understand genetic mechanisms,delineate disease pathogenesis,and guide drug design.With the increase of PPI data and development of machine learning technol...Protein-protein interactions(PPIs)are of great importance to understand genetic mechanisms,delineate disease pathogenesis,and guide drug design.With the increase of PPI data and development of machine learning technologies,prediction and identification of PPIs have become a research hotspot in proteomics.In this study,we propose a new prediction pipeline for PPIs based on gradient tree boosting(GTB).First,the initial feature vector is extracted by fusing pseudo amino acid composition(Pse AAC),pseudo position-specific scoring matrix(Pse PSSM),reduced sequence and index-vectors(RSIV),and autocorrelation descriptor(AD).Second,to remove redundancy and noise,we employ L1-regularized logistic regression(L1-RLR)to select an optimal feature subset.Finally,GTB-PPI model is constructed.Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15% and 90.47% on Saccharomyces cerevisiae and Helicobacter pylori datasets,respectively.In addition,GTB-PPI could be applied to predict the independent test datasets for Caenorhabditis elegans,Escherichia coli,Homo sapiens,and Mus musculus,the one-core PPI network for CD9,and the crossover PPI network for the Wnt-related signaling pathways.The results show that GTB-PPI can significantly improve accuracy of PPI prediction.The code and datasets of GTB-PPI can be downloaded from https://github.com/QUST-AIBBDRC/GTB-PPI/.展开更多
Biomaterials with surface nanostructures effectively enhance protein secretion and stimulate tissue regeneration.When nanoparticles(NPs)enter the living system,they quickly interact with proteins in the body fluid,for...Biomaterials with surface nanostructures effectively enhance protein secretion and stimulate tissue regeneration.When nanoparticles(NPs)enter the living system,they quickly interact with proteins in the body fluid,forming the protein corona(PC).The accurate prediction of the PC composition is critical for analyzing the osteoinductivity of biomaterials and guiding the reverse design of NPs.However,achieving accurate predictions remains a significant challenge.Although several machine learning(ML)models like Random Forest(RF)have been used for PC prediction,they often fail to consider the extreme values in the abundance region of PC absorption and struggle to improve accuracy due to the imbalanced data distribution.In this study,resampling embedding was introduced to resolve the issue of imbalanced distribution in PC data.Various ML models were evaluated,and RF model was finally used for prediction,and good correlation coefficient(R^(2))and root-mean-square deviation(RMSE)values were obtained.Our ablation experiments demonstrated that the proposed method achieved an R^(2) of 0.68,indicating an improvement of approximately 10%,and an RMSE of 0.90,representing a reduction of approximately 10%.Furthermore,through the verification of label-free quantification of four NPs:hydroxyapatite(HA),titanium dioxide(TiO_(2)),silicon dioxide(SiO_(2))and silver(Ag),and we achieved a prediction performance with an R^(2) value>0.70 using Random Oversampling.Additionally,the feature analysis revealed that the composition of the PC is most significantly influenced by the incubation plasma concentration,PDI and surface modification.展开更多
目的探讨C-X-C家族趋化因子受体4(CXCR4)和基质细胞衍生因子1(SDF1)在胃癌中的表达情况及其临床意义。方法选取300例胃癌患者的病理组织标本,同时选取上述胃癌患者的癌旁组织、正常胃组织作为对照。采用免疫组织化学染色法检测CXCR4和S...目的探讨C-X-C家族趋化因子受体4(CXCR4)和基质细胞衍生因子1(SDF1)在胃癌中的表达情况及其临床意义。方法选取300例胃癌患者的病理组织标本,同时选取上述胃癌患者的癌旁组织、正常胃组织作为对照。采用免疫组织化学染色法检测CXCR4和SDF1蛋白的表达情况,采用实时逆转录聚合酶链反应(RTPCR)法检测CXCR4 mRNA和SDF1 m RNA的相对表达水平。结果胃癌组织和癌旁组织中的CXCR4蛋白阳性表达率和CXCR4 mRNA相对表达水平均高于正常胃组织,差异均有统计学意义(P﹤0.05)。正常胃组织、癌旁组织和胃癌组织中SDF1蛋白的阳性表达率和SDF1 m RNA相对表达水平比较,差异均无统计学意义(P﹥0.05)。不同临床分期和淋巴结转移胃癌患者胃癌组织中CXCR4蛋白阳性表达率和CXCR4 mRNA相对表达水平比较,差异均有统计学意义(P﹤0.05)。结论 CXCR4蛋白和CXCR4 mRNA相对表达水平在胃癌组织中明显升高,可能与胃癌患者的临床分期和淋巴结转移有关。展开更多
Phosphorylation of protein is an important post-translational modification that enables activation of various enzymes and receptors included in signaling pathways. To reduce the cost of identifying phosphorylation sit...Phosphorylation of protein is an important post-translational modification that enables activation of various enzymes and receptors included in signaling pathways. To reduce the cost of identifying phosphorylation site by laborious experiments, computational prediction of it has been actively studied. In this study, by adopting a new set of features and applying feature selection by Random Forest with grid search before training by Support Vector Machine, our method achieved better or comparable performance of phosphorylation site prediction for two different data sets.展开更多
Correct prediction of propensity of crystallization of proteins is important for cost- and time-saving in determination of 3-demensional structures because one can focus to crystallize the proteins whose propensity is...Correct prediction of propensity of crystallization of proteins is important for cost- and time-saving in determination of 3-demensional structures because one can focus to crystallize the proteins whose propensity is high through predictions instead of choosing proteins randomly. However, so far this job has yet to accomplish although huge efforts have been made over years, because it is still extremely hard to find an intrinsic feature in a protein to directly relate to the propensity of crystallization of the given protein. Despite of this difficulty, efforts are never stopped in testing of known features in amino acids and proteins versus the propensity of crystallization of proteins from various sources. In this study, the comparison of the features, which were developed by us, with the features from well-known resource for the prediction of propensity of crystallization of proteins from Bacillus haloduran was conducted. In particular, the propensity of crystallization of proteins is considered as a yes-no event, so 185 crystallized proteins and 270 uncrystallized proteins from B. haloduran were classified as yes-no events. Each of 540 amino-acid features including the features developed by us was coupled with these yes-no events using logistic regression and neural network. The results once again demonstrated that the predictions using the features developed by us are relatively better than the predictions using any of 540 amino-acid features.展开更多
文摘酶功能的识别对理解生命活动的机制、推进生命科学的发展有重要作用。然而现有的酶EC编号预测方法,并未充分利用蛋白质序列信息,在识别精度上仍有所不足。针对上述问题,本研究提出一种基于层级特征和全局特征的EC编号预测网络(EC number prediction network using hierarchical features and global features,ECPN-HFGF)。该方法首先通过残差网络提取蛋白质序列通用特征,并通过层级特征提取模块和全局特征提取模块进一步提取蛋白质序列的层级特征和全局特征,之后结合两种特征信息的预测结果,采用多任务学习框架,实现酶EC编号的精确预测。计算实验结果表明,ECPN-HFGF方法在蛋白质序列EC编号预测任务上性能最佳,宏观F1值和微观F1值分别达到95.5%和99.0%。ECPN-HFGF方法能有效结合蛋白质序列的层级特征和全局特征,快速准确预测蛋白质序列EC编号,比当前常用方法预测精确度更高,能够为酶学研究和酶工程应用的发展提供一种高效的思路和方法。
基金supported by the National Natural Science Foundation of China(Grant No.61863010)the Key Research and Development Program of Shandong Province of China(Grant No.2019GGX101001)the Natural Science Foundation of Shandong Province of China(Grant No.ZR2018MC007)。
文摘Protein-protein interactions(PPIs)are of great importance to understand genetic mechanisms,delineate disease pathogenesis,and guide drug design.With the increase of PPI data and development of machine learning technologies,prediction and identification of PPIs have become a research hotspot in proteomics.In this study,we propose a new prediction pipeline for PPIs based on gradient tree boosting(GTB).First,the initial feature vector is extracted by fusing pseudo amino acid composition(Pse AAC),pseudo position-specific scoring matrix(Pse PSSM),reduced sequence and index-vectors(RSIV),and autocorrelation descriptor(AD).Second,to remove redundancy and noise,we employ L1-regularized logistic regression(L1-RLR)to select an optimal feature subset.Finally,GTB-PPI model is constructed.Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15% and 90.47% on Saccharomyces cerevisiae and Helicobacter pylori datasets,respectively.In addition,GTB-PPI could be applied to predict the independent test datasets for Caenorhabditis elegans,Escherichia coli,Homo sapiens,and Mus musculus,the one-core PPI network for CD9,and the crossover PPI network for the Wnt-related signaling pathways.The results show that GTB-PPI can significantly improve accuracy of PPI prediction.The code and datasets of GTB-PPI can be downloaded from https://github.com/QUST-AIBBDRC/GTB-PPI/.
基金sponsored by the National Key Research and Development Program of China(2021YFB3802100,2021YFB3802105)the Major Project of Sichuan Science and Technology Department(2022ZDZX0029)the Miaozi Project of Sichuan Science and Technology Department(2023JDRC0097)。
文摘Biomaterials with surface nanostructures effectively enhance protein secretion and stimulate tissue regeneration.When nanoparticles(NPs)enter the living system,they quickly interact with proteins in the body fluid,forming the protein corona(PC).The accurate prediction of the PC composition is critical for analyzing the osteoinductivity of biomaterials and guiding the reverse design of NPs.However,achieving accurate predictions remains a significant challenge.Although several machine learning(ML)models like Random Forest(RF)have been used for PC prediction,they often fail to consider the extreme values in the abundance region of PC absorption and struggle to improve accuracy due to the imbalanced data distribution.In this study,resampling embedding was introduced to resolve the issue of imbalanced distribution in PC data.Various ML models were evaluated,and RF model was finally used for prediction,and good correlation coefficient(R^(2))and root-mean-square deviation(RMSE)values were obtained.Our ablation experiments demonstrated that the proposed method achieved an R^(2) of 0.68,indicating an improvement of approximately 10%,and an RMSE of 0.90,representing a reduction of approximately 10%.Furthermore,through the verification of label-free quantification of four NPs:hydroxyapatite(HA),titanium dioxide(TiO_(2)),silicon dioxide(SiO_(2))and silver(Ag),and we achieved a prediction performance with an R^(2) value>0.70 using Random Oversampling.Additionally,the feature analysis revealed that the composition of the PC is most significantly influenced by the incubation plasma concentration,PDI and surface modification.
文摘目的探讨C-X-C家族趋化因子受体4(CXCR4)和基质细胞衍生因子1(SDF1)在胃癌中的表达情况及其临床意义。方法选取300例胃癌患者的病理组织标本,同时选取上述胃癌患者的癌旁组织、正常胃组织作为对照。采用免疫组织化学染色法检测CXCR4和SDF1蛋白的表达情况,采用实时逆转录聚合酶链反应(RTPCR)法检测CXCR4 mRNA和SDF1 m RNA的相对表达水平。结果胃癌组织和癌旁组织中的CXCR4蛋白阳性表达率和CXCR4 mRNA相对表达水平均高于正常胃组织,差异均有统计学意义(P﹤0.05)。正常胃组织、癌旁组织和胃癌组织中SDF1蛋白的阳性表达率和SDF1 m RNA相对表达水平比较,差异均无统计学意义(P﹥0.05)。不同临床分期和淋巴结转移胃癌患者胃癌组织中CXCR4蛋白阳性表达率和CXCR4 mRNA相对表达水平比较,差异均有统计学意义(P﹤0.05)。结论 CXCR4蛋白和CXCR4 mRNA相对表达水平在胃癌组织中明显升高,可能与胃癌患者的临床分期和淋巴结转移有关。
文摘Phosphorylation of protein is an important post-translational modification that enables activation of various enzymes and receptors included in signaling pathways. To reduce the cost of identifying phosphorylation site by laborious experiments, computational prediction of it has been actively studied. In this study, by adopting a new set of features and applying feature selection by Random Forest with grid search before training by Support Vector Machine, our method achieved better or comparable performance of phosphorylation site prediction for two different data sets.
文摘Correct prediction of propensity of crystallization of proteins is important for cost- and time-saving in determination of 3-demensional structures because one can focus to crystallize the proteins whose propensity is high through predictions instead of choosing proteins randomly. However, so far this job has yet to accomplish although huge efforts have been made over years, because it is still extremely hard to find an intrinsic feature in a protein to directly relate to the propensity of crystallization of the given protein. Despite of this difficulty, efforts are never stopped in testing of known features in amino acids and proteins versus the propensity of crystallization of proteins from various sources. In this study, the comparison of the features, which were developed by us, with the features from well-known resource for the prediction of propensity of crystallization of proteins from Bacillus haloduran was conducted. In particular, the propensity of crystallization of proteins is considered as a yes-no event, so 185 crystallized proteins and 270 uncrystallized proteins from B. haloduran were classified as yes-no events. Each of 540 amino-acid features including the features developed by us was coupled with these yes-no events using logistic regression and neural network. The results once again demonstrated that the predictions using the features developed by us are relatively better than the predictions using any of 540 amino-acid features.