Um novo método para seleção de variáveis preditivas com base em índices de importância
A new framework for predictive variable selection based on variable importance indices
Zimmer, Juliano; Anzanello, Michel José
http://dx.doi.org/10.1590/S0103-65132013005000030
Production, vol.24, n1, p.84-93, 2014
Resumo
O grande volume de variáveis coletadas em processos industriais impõe dificuldades ao controle e monitoramento de tais processos. A regressão PLS (partial least squares) vem sendo amplamente utilizada em procedimentos de seleção de variáveis por sua capacidade de operar com grande número de variáveis correlacionadas e afetadas por ruído. Este artigo propõe um método para identificar o melhor subconjunto de variáveis de processo para a predição das variáveis de resposta. Indicadores de importância das variáveis são desenvolvidos a partir de parâmetros da regressão PLS e guiam a eliminação das variáveis irrelevantes. Tais índices são então testados em termos de seu desempenho. Ao ser aplicado em cinco bancos de dados industriais, o método utilizando o índice recomendado reteve apenas 31% das variáveis originais e aumentou a acurácia de predição do conjunto de teste em 6%. O método proposto também superou a acurácia do método Stepwise, tradicionalmente utilizado em procedimentos de seleção com propósitos de predição.
Palavras-chave
Seleção de variáveis. Regressão PLS. Indicador de importância das variáveis
Abstract
The large volume of process variables collected from manufacturing applications has jeopardized process control activities. The Partial Least Squares (PLS) regression has been widely used for variable selection due to its ability to handle a large number of correlated and noisy variables. This paper presents a method for selecting the most relevant variables aimed at predicting product variables. For that matter, variable importance indices are developed based on PLS parameters and used to guide the elimination of noisy and irrelevant variables. Variables are then systematically removed from the dataset and the performance of the predictive model evaluated. When applied to five manufacturing datasets, the proposed method retained 31% of the original variables and yielded 6% more accurate predictions than using all original variables. Further, the proposed method outperformed the traditional Stepwise method regarding prediction accuracy.
Keywords
Variable selection. PLS regression. Variable importance índices
References
ANDERSEN, C. M.; BRO, R. Variable selection in regression – a tutorial. Journal of Chemometrics, v. 24, p. 728‑737, 2010. http://dx.doi.org/10.1002/cem.1360
ANZANELLO, M. J.; ALBIN, S. L.; CHAOVALITWONGSE, W. A. Selecting the best variables for classifying production batches into two quality levels. Chemometrics Intelligent Laboratory Systems, v. 97, p. 111-117, 2009. http://dx.doi.org/10.1016/j.chemolab.2009.03.004
ANZANELLO, M. J.; ALBIN, S. L.; CHAOVALITWONGSE, W. Multicriteria variable selection for classification of production batches. European Journal of Operational Research, v. 218, p. 97-105, 2012. http://dx.doi.org/10.1016/j.ejor.2011.10.015
CHIANG, L. H.; PELL, R. J. Genetic algorithms combined with discriminant analysis for key variable identification. Journal of Process Control, v. 14, p. 143-155, 2004. http://dx.doi.org/10.1016/S0959-1524(03)00029-5
CHONG, I.-G.; JUN, C.-H. Performance of some variable selection methods when multicollinearity is present. Chemometrics Intelligent Laboratory Systems, v. 78, p. 103-112, 2005. http://dx.doi.org/10.1016/j.chemolab.2004.12.011
DENHAM, M. C. Choosing the number of factors in partial least squares regression: estimating and minimizing the mean squared error of prediction. Journal of Chemometrics, v. 14, p. 351-361, 2000. http://dx.doi.org/10.1002/1099-128X(200007/08)14:4<351::AID-CEM598>3.0.CO;2-Q
ERIKSSON, L.; WOLD, S. A graphical index of separation (GIOS) in multivariate modeling. Journal of Chemometrics, Bognor Regis, v. 24, p. 779-789, 2010.
ESPOSITO-VINZI, V. et al. Handbook of Partial Least Squares: Concepts, Methods and Applications in Marketing and Related Fields. Berlin: Springer, 2007. 850 p.
FERRER, A. et al. PLS: A versatile tool for industrial process improvement and optimization. Applied Stochastic Models in Business and Industry, v. 24, p. 551-567, 2008. http://dx.doi.org/10.1002/asmb.716
GAUCHI, J. P.; CHAGNON, P. Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data. Chemometrics Intelligent Laboratory Systems, v. 58, p. 171-193, 2001. http://dx.doi.org/10.1016/S0169-7439(01)00158-7
GELADI, P.; KOWALSKI, B. Partial least-squares regression: a tutorial. Analytica Chimica Acta, v. 185, p. 1-17, 1986. http://dx.doi.org/10.1016/0003-2670(86)80028-9
HÖSKULDSSON, A. Variable and subset selection in PLS regression. Chemometrics and Intelligent Laboratory Systems, v. 55, p. 23-38, 2001. http://dx.doi.org/10.1016/S0169-7439(00)00113-1
KONDYLIS, A.; WHITTAKER, J. Adaptively preconditioned Krylov spaces to identify irrelevant predictors. Chemometrics and Intelligent Laboratory Systems, v. 104, p. 205-213, 2010. http://dx.doi.org/10.1016/j.chemolab.2010.08.010
KOURTI, T.; MacGREGOR, J. F. Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometrics Intelligent Laboratory Systems, v. 28, p. 3-21, 1995.
LAZRAQ, A.; CLÉROUX, R. The PLS multivariate regression model: testing the significance of successive PLS components. Journal of Chemometrics, v. 15, p. 523‑536, 2001. http://dx.doi.org/10.1002/cem.641
LAZRAQ, A.; CLÉROUX, R.; GAUCHI, J.-P. Selecting both latent and explanatory variables in the PLS1 regression model. Chemometrics Intelligent Laboratory Systems, v. 66, p. 117-126, 2003. http://dx.doi.org/10.1016/S0169-7439(03)00027-3
MARTIN, E. B.; MORRIS, A. J.; KIPARISSIDES, C. Manufacturing performance enhancement through multivariate statistical process control. Annual Reviews in Control, v. 23, p. 35-44, 1999.
MONTGOMERY, D. C. Introdução ao controle estatístico da qualidade. 4. ed. Rio de Janeiro: LTC – Livros Técnicos e Científicos Editora S.A., 2004. 513 p.
MONTGOMERY, D. C.; RUNGER, G. C. Estatística aplicada e probabilidade para engenheiros. 4. ed. Rio de Janeiro: LTC – Livros Técnicos e Científicos Editora S.A., 2009. 493 p.
PIERNA, J. A. F. et al. A Backward Variable Selection method for PLS regression (BVSPLS). Analytica Chimica Acta, v. 642, p. 89-93, 2009. PMid:19427462. http://dx.doi.org/10.1016/j.aca.2008.12.002
WOLD, S.; SJÖSTRÖM, M.; ERIKSSON, L. PLS-regression: a basic tool of chemometrics. Chemometrics Intelligent Laboratory Systems, v. 58, p. 109-130, 2001. http://dx.doi.org/10.1016/S0169-7439(01)00155-1
XIAOBO, Z. et al. Independent component analysis in information extraction from visible/near-infrared hyperspectral imaging data of cucumber leaves. Chemometrics and Intelligent Laboratory Systems, v. 104, p. 265-270, 2010. http://dx.doi.org/10.1016/j.chemolab.2010.08.019
ZHAI, H. L.; CHEN, X. G.; HU, Z. D. A new approach for the identification of important variables. Chemometrics Intelligent Laboratory Systems, v. 80, p. 130-135, 2006. http://dx.doi.org/10.1016/j.chemolab.2005.09.002