Inferential models are widely used in the chemical industry to infer key process variables, which are challenging or expensive to measure, from other more easily measured variables. The aim of this paper is three-fold...Inferential models are widely used in the chemical industry to infer key process variables, which are challenging or expensive to measure, from other more easily measured variables. The aim of this paper is three-fold: to present a theoretical review of some of the well known linear inferential modeling techniques, to enhance the predictive ability of the regularized canonical correlation analysis (RCCA) method, and finally to compare the performances of these techniques and highlight some of the practical issues that can affect their predictive abilities. The inferential modeling techniques considered in this study include full rank modeling techniques, such as ordinary least square (OLS) regression and ridge regression (RR), and latent variable regression (LVR) techniques, such as principal component regression (PCR), partial least squares (PLS) regression, and regularized canonical correlation analysis (RCCA). The theoretical analysis shows that the loading vectors used in LVR modeling can be computed by solving eigenvalue problems. Also, for the RCCA method, we show that by optimizing the regularization parameter, an improvement in prediction accuracy can be achieved over other modeling techniques. To illustrate the performances of all inferential modeling techniques, a comparative analysis was performed through two simulated examples, one using synthetic data and the other using simulated distillation column data. All techniques are optimized and compared by computing the cross validation mean square error using unseen testing data. The results of this comparative analysis show that scaling the data helps improve the performances of all modeling techniques, and that the LVR techniques outperform the full rank ones. One reason for this advantage is that the LVR techniques improve the conditioning of the model by discarding the latent variables (or principal components) with small eigenvalues, which also reduce the effect of the noise on the model prediction. The results also show that PCR and PLS have compara展开更多
Purpose: General linear modeling (GLM) is usually applied to investigate factors associated with the domains of Quality of Life (QOL). A summation score in a specific sub-domain is regressed by a statistical model inc...Purpose: General linear modeling (GLM) is usually applied to investigate factors associated with the domains of Quality of Life (QOL). A summation score in a specific sub-domain is regressed by a statistical model including factors that are associated with the sub-domain. However, using the summation score ignores the influence of individual questions. Structural equation modeling (SEM) can account for the influence of each question’s score by compositing a latent variable from each question of a sub-domain. The objective of this study is to determine whether a conventional approach such as GLM, with its use of the summation score, is valid from the standpoint of the SEM approach. Method: We used the Japanese version of the Maugeri Foundation Respiratory Failure Questionnaire, a QOL measure, on 94 patients with heart failure. The daily activity sub-domain of the questionnaire was selected together with its four accompanying factors, namely, living together, occupation, gender, and the New York Heart Association’s cardiac function scale (NYHA). The association level between individual factors and the daily activity sub-domain was estimated using SEM?and GLM, respectively. The standard partial regression coefficients of GLM and standardized path coefficients of SEM were compared. If?these coefficients were similar (absolute value of the difference -0.06 and -0.07 for the GLM and SEM. Likewise, the estimates of occupation, gender, and NYHA were -0.18 and -0.20, -0.08 and -0.08, 0.51 and 0.54, respectively. The absolute values of the difference for each factor were 0.01, 0.02, 0.00, and 0.03, respectively. All differences were less than 0.05. This means that these two approaches lead to similar conclusions. Conclusion: GLM is a valid method for exploring association factors with a domain in QOL.展开更多
集成学习已成为一种广泛使用的软测量建模框架,但是建立高性能的集成学习软测量模型依然面临特征选择不当、基模型多样性不足、基模型估计性能不佳等诸多挑战.为此,提出一种基于堆栈自编码器多样性生成机制的选择性集成学习高斯过程回归...集成学习已成为一种广泛使用的软测量建模框架,但是建立高性能的集成学习软测量模型依然面临特征选择不当、基模型多样性不足、基模型估计性能不佳等诸多挑战.为此,提出一种基于堆栈自编码器多样性生成机制的选择性集成学习高斯过程回归(selective ensemble of stacked autoencoder based Gaussian process regression, SESAEGPR)软测量建模方法.该方法充分发挥深度学习在特征提取方面的优势,通过构建多样性的堆栈自编码器(stacked autoencoder, SAE)网络,建立基于隐特征的高斯过程回归(Gaussian process regression, GPR)基模型.基于模型性能提升率和进化多目标优化对SAEGPR基模型实施两次集成修剪,以降低集成模型复杂度、保持甚至进一步提升模型估计性能,最后,引入PLS Stacking集成策略实现基模型融合.所提出方法显著优于传统全局和全集成软测量建模方法,其有效性和优越性通过青霉素发酵过程和Tennessee Eastman化工过程得到验证.展开更多
The number of latent variables (LVs) or the factor number is a key parameter in PLS modeling to obtain a correct prediction. Although lots of work have been done on this issue, it is still a difficult task to determin...The number of latent variables (LVs) or the factor number is a key parameter in PLS modeling to obtain a correct prediction. Although lots of work have been done on this issue, it is still a difficult task to determine a suitable LV number in practical uses. A method named independent factor diagnostics (IFD) is proposed for investigation of the contribution of each LV to the predicted results on the basis of discussion about the determination of LV number in PLS modeling for near infrared (NIR) spectra of complex samples. The NIR spectra of three data sets of complex samples, including a public data set and two tobacco lamina ones, are investigated. It is shown that several high order LVs constitute main contributions to the predicted results, albeit the contribution of the low order LVs should not be neglected in the PLS models. Therefore, in practical uses of PLS for analysis of complex samples, it may be better to use a slightly large LV number for NIR spectral analysis of complex samples.展开更多
文摘Inferential models are widely used in the chemical industry to infer key process variables, which are challenging or expensive to measure, from other more easily measured variables. The aim of this paper is three-fold: to present a theoretical review of some of the well known linear inferential modeling techniques, to enhance the predictive ability of the regularized canonical correlation analysis (RCCA) method, and finally to compare the performances of these techniques and highlight some of the practical issues that can affect their predictive abilities. The inferential modeling techniques considered in this study include full rank modeling techniques, such as ordinary least square (OLS) regression and ridge regression (RR), and latent variable regression (LVR) techniques, such as principal component regression (PCR), partial least squares (PLS) regression, and regularized canonical correlation analysis (RCCA). The theoretical analysis shows that the loading vectors used in LVR modeling can be computed by solving eigenvalue problems. Also, for the RCCA method, we show that by optimizing the regularization parameter, an improvement in prediction accuracy can be achieved over other modeling techniques. To illustrate the performances of all inferential modeling techniques, a comparative analysis was performed through two simulated examples, one using synthetic data and the other using simulated distillation column data. All techniques are optimized and compared by computing the cross validation mean square error using unseen testing data. The results of this comparative analysis show that scaling the data helps improve the performances of all modeling techniques, and that the LVR techniques outperform the full rank ones. One reason for this advantage is that the LVR techniques improve the conditioning of the model by discarding the latent variables (or principal components) with small eigenvalues, which also reduce the effect of the noise on the model prediction. The results also show that PCR and PLS have compara
文摘Purpose: General linear modeling (GLM) is usually applied to investigate factors associated with the domains of Quality of Life (QOL). A summation score in a specific sub-domain is regressed by a statistical model including factors that are associated with the sub-domain. However, using the summation score ignores the influence of individual questions. Structural equation modeling (SEM) can account for the influence of each question’s score by compositing a latent variable from each question of a sub-domain. The objective of this study is to determine whether a conventional approach such as GLM, with its use of the summation score, is valid from the standpoint of the SEM approach. Method: We used the Japanese version of the Maugeri Foundation Respiratory Failure Questionnaire, a QOL measure, on 94 patients with heart failure. The daily activity sub-domain of the questionnaire was selected together with its four accompanying factors, namely, living together, occupation, gender, and the New York Heart Association’s cardiac function scale (NYHA). The association level between individual factors and the daily activity sub-domain was estimated using SEM?and GLM, respectively. The standard partial regression coefficients of GLM and standardized path coefficients of SEM were compared. If?these coefficients were similar (absolute value of the difference -0.06 and -0.07 for the GLM and SEM. Likewise, the estimates of occupation, gender, and NYHA were -0.18 and -0.20, -0.08 and -0.08, 0.51 and 0.54, respectively. The absolute values of the difference for each factor were 0.01, 0.02, 0.00, and 0.03, respectively. All differences were less than 0.05. This means that these two approaches lead to similar conclusions. Conclusion: GLM is a valid method for exploring association factors with a domain in QOL.
文摘集成学习已成为一种广泛使用的软测量建模框架,但是建立高性能的集成学习软测量模型依然面临特征选择不当、基模型多样性不足、基模型估计性能不佳等诸多挑战.为此,提出一种基于堆栈自编码器多样性生成机制的选择性集成学习高斯过程回归(selective ensemble of stacked autoencoder based Gaussian process regression, SESAEGPR)软测量建模方法.该方法充分发挥深度学习在特征提取方面的优势,通过构建多样性的堆栈自编码器(stacked autoencoder, SAE)网络,建立基于隐特征的高斯过程回归(Gaussian process regression, GPR)基模型.基于模型性能提升率和进化多目标优化对SAEGPR基模型实施两次集成修剪,以降低集成模型复杂度、保持甚至进一步提升模型估计性能,最后,引入PLS Stacking集成策略实现基模型融合.所提出方法显著优于传统全局和全集成软测量建模方法,其有效性和优越性通过青霉素发酵过程和Tennessee Eastman化工过程得到验证.
基金Supported by the National Natural Science Foundation of China (Grant Nos. 20775036 & 20835002)
文摘The number of latent variables (LVs) or the factor number is a key parameter in PLS modeling to obtain a correct prediction. Although lots of work have been done on this issue, it is still a difficult task to determine a suitable LV number in practical uses. A method named independent factor diagnostics (IFD) is proposed for investigation of the contribution of each LV to the predicted results on the basis of discussion about the determination of LV number in PLS modeling for near infrared (NIR) spectra of complex samples. The NIR spectra of three data sets of complex samples, including a public data set and two tobacco lamina ones, are investigated. It is shown that several high order LVs constitute main contributions to the predicted results, albeit the contribution of the low order LVs should not be neglected in the PLS models. Therefore, in practical uses of PLS for analysis of complex samples, it may be better to use a slightly large LV number for NIR spectral analysis of complex samples.