摘要
乳腺癌是一种致死率较高的癌症。人体的乳腺上皮细胞在多种致癌因子的共同作用下发生增殖失控而形成癌变。本文针对提供的ERα拮抗剂信息,通过建立化合物生物活性的定量预测模型和ADMET性质的分类预测模型,为同时优化ERα拮抗剂的生物活性和ADMET性质提供预测服务。首先利用随机森林算法评价变量重要度大小筛选出贡献度排名前60的分子描述符;然后通过高相关性变量去耦合,对前60个分子描述符进行高相关性滤波处理,从而得到前20个对生物活性最具有显著影响的分子描述符;最后基于高相关度变量滤波算法保证了降维后分子描述符之间的独立性,对分子描述符之间的相关程度进行可视化,从而验证了其合理性。其次,在通过尝试构建多元线性回归方程解决此题时,发现时序残差图的异常点较多后,我们构建了多元非线性回归模型。首先利用python对变量进行标准化操作,得到标准化指标;其次利用问题一得到的前20个分子描述符作为自变量,通过对一些数值较大的变量取自然对数,建立了用于预测生物活性的多元非线性回归模型。最后找出影响ADMET性质的前10个分子描述符,并分别对各分子描述符之间的相关程度进行可视化;其次利用全连接单层神经网络优秀的非线性映射能力构建5个化合物的分类预测模型,并通过各个化合物的分类预测模型的交叉熵损失图说明了模型有着较高的准确度。
Breast cancer is a kind of cancer with high mortality. Under the combined action of various car-cinogenic factors, human breast epithelial cells undergo uncontrolled proliferation and form can-cerous transformation. This article aims to provide prediction services for optimizing the biological activity and ADMET properties of ERα antagonists by establishing a quantitative prediction model for compound biological activity and a classification prediction model for ADMET properties. Firstly, the random forest algorithm is used to evaluate the importance of variables to screen out the top 60 molecular descriptors of contribution degree;then, by decoupling highly correlated variables, the first 60 molecular descriptors were subjected to high correlation filtering to obtain the top 20 mo-lecular descriptors that had the most significant impact on biological activity;finally, based on the high correlation variable filtering algorithm, the independence between molecular descriptors after dimensionality reduction was ensured, and the degree of correlation between molecular de-scriptors was visualized to verify its rationality. Secondly, when attempting to construct a multiple linear regression equation to solve this problem, we found that there were many outliers in the time series residual plot, and then constructed a multiple nonlinear regression model;firstly, use Python to standardize variables and obtain standardized indicators;secondly, the first 20 molecu-lar descriptors obtained in question 1 are used as independent variables, and a multivariate non-linear regression model for predicting biological activity is established by taking natural logarithm for some variables with large values. Finally, identify the top 10 molecular descriptors that affect the properties of ADMET and visualize the degree of correlation between each molecular descriptor;secondly, the classification prediction model of five compounds was constructed by using the excel-lent nonlinear mapping ability of the fully connected single-layer neura
出处
《应用数学进展》
2023年第6期3098-3111,共14页
Advances in Applied Mathematics