Seleção de variáveis para clusterização de bateladas produtivas através de ACP e remapeamento kernel

Clustering variable selection for grouping production batches through PCA and kernel mapping

Cervo, Victor Leonardo; Anzanello, Michel José

Downloads: 0
Views: 413


Técnicas de clusterização visam à formação de grupos de observações homogêneas dentro de um mesmo grupo e significativamente distintas das observações inseridas em outros grupos. Em processos industriais cuja produção é apoiada em bateladas, a definição de famílias (grupos) de bateladas com perfis semelhantes auxilia na definição de estratégias de controle e monitoramento desses processos. Este artigo propõe um método para seleção das variáveis de clusterização mais relevantes para formação de famílias de bateladas. Para tanto, integra funções kernel a um novo índice de importância de variáveis gerado a partir dos parâmetros oriundos da Análise de Componentes Principais (ACP). A qualidade dos agrupamentos formados é avaliada através do Silhouette Index (SI). Quando aplicada em três processos produtivos, a sistemática proposta reteve em média 5,16% das variáveis iniciais e elevou o SI médio em 235,4% frente à utilização de todas as variáveis. Um estudo de simulação também é realizado para avaliar a robustez do método.


Análise de clusterização. Seleção de variáveis. Kernel. Processos em batelada.


Clustering techniques are tailored to find internally homogeneous groups of observations. In industrial processes that rely on batches, grouping batches with similar profiles provides valuable information about process control and monitoring. This paper proposes a variable selection approach based on the kernel function and Principal Component Analysis (PCA). The clustering quality is assessed through the Silhouette Index (SI). When applied to three industrial processes, the proposed approach retained an average of 5.16% of the original variables, yielding on average a 235.4% more precise batch grouping. We also performed a simulation experiment.


Clustering analysis. Variable selection. Kernel. Batch processes.


Abe, S. (2010). Support Vector Machines for Pattern Recognition (2nd ed.). London: Springer-Verlag.

Agard, B., & Penz, B. (2009). A simulated annealing method based on a clustering approach to determine bills of materials for a large product family. International Journal of Production Economics, 117(2), 389-401.

Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis (3rd ed.). New Jersey: John Wiley & Sons, Inc.

Anzanello, M. J., & Fogliatto, F. S. (2011). Selecting the best clustering variables for grouping mass-customized products involving workers’ learning. International Journal of Production Economics, 130(2), 268-276.

Baghshah, M. S., & Shouraki, S. B. (2011). Learning low-rank kernel matrices for constrained clustering. Neurocomputing, 74, 2201-2211.

Bessaoud, F., Tretarre, B., Daurès, J. P., & Gerber, M. (2012). Identification of dietary patterns using two statistical approaches and their association with breast cancer risk: a case-control study in southern France. Annals of Epidemiology, 22(7), 499-510. PMid:22571994.

Bouveyron, C., Girard, S., & Schmid, C. (2007). High-dimensional data clustering. Computational Statistics and Data Analysis, 52, 502-519.

Brusco, M. J. (2004). Clustering binary data in the presence of masking variables. Psychological Methods, 9, 510-523. PMid:15598102.

Brusco, M. J., & Cradit, J. D. (2001). A variable-selection heuristic for k-means clustering. Psychometrika, 66(2), 249-270.

Dean, N., & Raftery, A. E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics, 62(1), 11-35. PMid:20827439 PMCid:PMC2934856.

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J. J., Sandhu, S., Guppy, K. H., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304-310.

Domenicone, C., Peng, J., & Yan, B. (2011). Composite kernels for semi-supervised clustering. Knowledge and Information Systems, 28(1), 99-116.

Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley-Interscience.

Filippone, M., Camastra, F., Masulli, F., & Rovetta, S. (2008). A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1), 176-190.

Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society, Series B, 66, 815-849.

Gauchi, J. P., & Chagnon, P. (2001). Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data. Chemometrics Intelligent Laboratory Systems, 58, 171-193.

Girolami, M. (2002). Mercer Kernel-Based Clustering in Feature Space. IEEE Transactions on Neual Networks, 13(3), 780-784. PMid:18244475.

Gnanadesikan, R., Kettenring, J., & Tsao, S. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12(1), 113-136.

Hair, J., Anderson, R., Tatham, R. & Black, W. (1995). Multivariate Data Analysis with Readings (4th ed.). New Jersey: Prentice-Hall Inc.

Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 657-668. PMid:15875789.

Huang, T., Kecman, V., & Kopriva, I. (2006). Kernel based algorithms for mining huge data sets, Supervised, Semi-supervised, and Unsupervised learning. Berlin: Springer-Verlag.

Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). New York: Springer-Verlag.

Kaufman, L., & Rousseeuw, P. (2005). Finding Groups in Data: an Introduction to Cluster Analysis. New Jersey: Wiley Interscience.

Li, Y., Dong, M., & Hua, J. (2008). Localized feature selection for clustering. Pattern Recognition Letters, 29(1), 10-18.

Maugis, C., Celeux, G., & Martin-Magniette, M. (2009). Variable selection for clustering with Gaussian mixture models. Biometrics, 65(3), 701-709. PMid:19210744.

Meek, C., Thiesson, B., & Heckerman, D. (2002). The learning-curve sampling method applied to model-based clustering. Journal of Machine Learning Research, 2, 397-418.

Milligan, G. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325-342.

Milligan, G., & Cooper, M. (1988). A study of standardization of variables in cluster analysis. Journal of Classification, 5, 181-204.

Raftery, A. E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168-178.

Rousseeuw, P. (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53-65.

Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyound. Cambridge: The MIT Press.

Steinley, D. (2004). Standardizing variables in K-means clustering. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53-60). New York: Springer.

Steinley, D., & Brusco, M. J. (2008a). A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research, 43(1), 77-108.

Steinley, D., & Brusco, M. J. (2008b). Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika, 73(1), 125-144.

Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1994). Machine learning techniques do diagnose breast cancer from fine-niddle aspirates. Cancer Letters, 77, 163-171.

Wold, S., Sjostrom, M., & Eriksson, L. (2001). PLS-regression: a basic tool of chemometrics.Chemometrics Intelligent Laboratory Systems, 58(2), 109-130.
5883a4617f8c9da00c8b48d9 production Articles
Links & Downloads


Share this page
Page Sections