Seleção de variáveis para clusterização de bateladas produtivas através de ACP e remapeamento kernel

Cervo, Victor Leonardo; Anzanello, Michel José

doi:10.1590/0103-6513.143613

Article

Seleção de variáveis para clusterização de bateladas produtivas através de ACP e remapeamento kernel

Clustering variable selection for grouping production batches through PCA and kernel mapping

Cervo, Victor Leonardo; Anzanello, Michel José

http://dx.doi.org/10.1590/0103-6513.143613 Production, vol.25, n4, p.823-833, 2015

PDF

Downloads: 0

Resumo

Técnicas de clusterização visam à formação de grupos de observações homogêneas dentro de um mesmo grupo e significativamente distintas das observações inseridas em outros grupos. Em processos industriais cuja produção é apoiada em bateladas, a definição de famílias (grupos) de bateladas com perfis semelhantes auxilia na definição de estratégias de controle e monitoramento desses processos. Este artigo propõe um método para seleção das variáveis de clusterização mais relevantes para formação de famílias de bateladas. Para tanto, integra funções kernel a um novo índice de importância de variáveis gerado a partir dos parâmetros oriundos da Análise de Componentes Principais (ACP). A qualidade dos agrupamentos formados é avaliada através do Silhouette Index (SI). Quando aplicada em três processos produtivos, a sistemática proposta reteve em média 5,16% das variáveis iniciais e elevou o SI médio em 235,4% frente à utilização de todas as variáveis. Um estudo de simulação também é realizado para avaliar a robustez do método.

Palavras-chave

Análise de clusterização. Seleção de variáveis. Kernel. Processos em batelada.

Abstract

Clustering techniques are tailored to find internally homogeneous groups of observations. In industrial processes that rely on batches, grouping batches with similar profiles provides valuable information about process control and monitoring. This paper proposes a variable selection approach based on the kernel function and Principal Component Analysis (PCA). The clustering quality is assessed through the Silhouette Index (SI). When applied to three industrial processes, the proposed approach retained an average of 5.16% of the original variables, yielding on average a 235.4% more precise batch grouping. We also performed a simulation experiment.

Keywords

Clustering analysis. Variable selection. Kernel. Batch processes.

References

Abe, S. (2010). Support Vector Machines for Pattern Recognition (2nd ed.). London: Springer-Verlag. http://dx.doi.org/10.1007/978-1-84996-098-4

Agard, B., & Penz, B. (2009). A simulated annealing method based on a clustering approach to determine bills of materials for a large product family. International Journal of Production Economics, 117(2), 389-401. http://dx.doi.org/10.1016/j.ijpe.2008.12.004

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons, Inc.

Anzanello, M. J., & Fogliatto, F. S. (2011). Selecting the best clustering variables for grouping mass-customized products involving workers’ learning. International Journal of Production Economics, 130(2), 268-276. http://dx.doi.org/10.1016/j.ijpe.2011.01.009

Baghshah, M. S., & Shouraki, S. B. (2011). Learning low-rank kernel matrices for constrained clustering. Neurocomputing, 74, 2201-2211. http://dx.doi.org/10.1016/j.neucom.2011.02.009

Bessaoud, F., Tretarre, B., Daurès, J. P., & Gerber, M. (2012). Identification of dietary patterns using two statistical approaches and their association with breast cancer risk: a case-control study in southern France. Annals of Epidemiology, 22(7), 499-510. PMid:22571994. http://dx.doi.org/10.1016/j.annepidem.2012.04.006

Bouveyron, C., Girard, S., & Schmid, C. (2007). High-dimensional data clustering. Computational Statistics and Data Analysis, 52, 502-519. http://dx.doi.org/10.1016/j.csda.2007.02.009

Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510-523. PMid:15598102. http://dx.doi.org/10.1037/1082-989X.9.4.510

Brusco, M. J., & Cradit, J. D. (2001). A variable-selection heuristic for k-means clustering. Psychometrika, 66(2), 249-270. http://dx.doi.org/10.1007/BF02294838

Dean, N., & Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics, 62(1), 11-35. PMid:20827439 PMCid:PMC2934856. http://dx.doi.org/10.1007/s10463-009-0258-9

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J. J., Sandhu, S., Guppy, K. H., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304-310. http://dx.doi.org/10.1016/0002-9149(89)90524-9

Domenicone, C., Peng, J., & Yan, B. (2011). Composite kernels for semi-supervised clustering. Knowledge and Information Systems, 28(1), 99-116. http://dx.doi.org/10.1007/s10115-010-0318-8

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley-Interscience.

Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), 176-190. http://dx.doi.org/10.1016/j.patcog.2007.05.018

Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society, Series B, 66, 815-849. http://dx.doi.org/10.1111/j.1467-9868.2004.02059.x

Gauchi, J. P., & Chagnon, P. (2001). Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data. Chemometrics Intelligent Laboratory Systems, 58, 171-193. http://dx.doi.org/10.1016/S0169-7439(01)00158-7

Girolami, M. (2002). Mercer Kernel-Based Clustering in Feature Space. IEEE Transactions on Neual Networks, 13(3), 780-784. PMid:18244475. http://dx.doi.org/10.1109/TNN.2002.1000150

Gnanadesikan, R., Kettenring, J., & Tsao, S. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12(1), 113-136. http://dx.doi.org/10.1007/BF01202271

Hair, J., Anderson, R., Tatham, R. & Black, W. (1995). Multivariate Data Analysis with Readings (4th ed.). New Jersey: Prentice-Hall Inc.

Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657-668. PMid:15875789. http://dx.doi.org/10.1109/TPAMI.2005.95

Huang, T., Kecman, V., & Kopriva, I. (2006). Kernel based algorithms for mining huge data sets, Supervised, Semi-supervised, and Unsupervised learning. Berlin: Springer-Verlag.

Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). New York: Springer-Verlag.

Kaufman, L., & Rousseeuw, P. (2005). Finding Groups in Data: an Introduction to Cluster Analysis. New Jersey: Wiley Interscience.

Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern Recognition Letters, 29(1), 10-18. http://dx.doi.org/10.1016/j.patrec.2007.08.012

Maugis, C., Celeux, G., & Martin-Magniette, M. (2009). Variable selection for clustering with Gaussian mixture models. Biometrics, 65(3), 701-709. PMid:19210744. http://dx.doi.org/10.1111/j.1541-0420.2008.01160.x

Meek, C., Thiesson, B., & Heckerman, D. (2002). The learning-curve sampling method applied to model-based clustering. Journal of Machine Learning Research, 2, 397-418.

Milligan, G. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325-342. http://dx.doi.org/10.1007/BF02293907

Milligan, G., & Cooper, M. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204. http://dx.doi.org/10.1007/BF01897163

Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168-178. http://dx.doi.org/10.1198/016214506000000113

Rousseeuw, P. (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53-65. http://dx.doi.org/10.1016/0377-0427(87)90125-7

Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyound. Cambridge: The MIT Press.

Steinley, D. (2004). Standardizing variables in K-means clustering. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53-60). New York: Springer. http://dx.doi.org/10.1007/978-3-642-17103-1_6

Steinley, D., & Brusco, M. J. (2008a). A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research, 43(1), 77-108. http://dx.doi.org/10.1080/00273170701836695

Steinley, D., & Brusco, M. J. (2008b). Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika, 73(1), 125-144. http://dx.doi.org/10.1007/s11336-007-9019-y

Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1994). Machine learning techniques do diagnose breast cancer from fine-niddle aspirates. Cancer Letters, 77, 163-171. http://dx.doi.org/10.1016/0304-3835(94)90099-X

Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics.Chemometrics Intelligent Laboratory Systems, 58(2), 109-130. http://dx.doi.org/10.1016/S0169-7439(01)00155-1

Seleção de variáveis para clusterização de bateladas produtivas através de ACP e remapeamento kernel

Clustering variable selection for grouping production batches through PCA and kernel mapping

Cervo, Victor Leonardo; Anzanello, Michel José

Resumo

Palavras-chave

Abstract

Keywords

References

Links

Share

Production