Probability of misclassification in model-based clustering

  • Xuwen ZhuEmail author
Short Note


Cluster analysis is an important problem of unsupervised machine learning. Model-based clustering is one of the most popular clustering techniques based on finite mixture models. Upon fitting of a mixture model, one question naturally arises as to how many misclassifications there are in the partition. At the same time, rather limited literature is devoted to developing diagnostic tools for obtained clustering solution. In this paper, an algorithm is developed for efficiently estimating the misclassification probability. The confusion probability map and classification confidence region are proposed for predicting the confusion matrix, identifying which cluster causes the most confusion, and understand the distribution of misclassifications. Application to real-life datasets illustrates the developed technique with promising results.


Finite mixture models Classification confidence region Diagnostics Misclassification 



The research is partially funded by the University of Louisville EVPRI internal research grant from the Office of the Executive Vice President for Research and Innovation.


  1. Anderson E (1935) The Irises of the Gaspe peninsula. Bull Am Iris Soc 59:2–5Google Scholar
  2. Azzalini A, Bowman AW (1990) A look at some data on the old faithful geyser. J R Stat Soc C 39:357–365zbMATHGoogle Scholar
  3. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821MathSciNetCrossRefzbMATHGoogle Scholar
  4. Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14:315–332MathSciNetCrossRefzbMATHGoogle Scholar
  5. Cook D, Weisberg S (1994) An introduction to regression graphics. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  6. Deb P, Trivedi PK (1997) Demand for medical care by the elderly: a finite mixture approach. J Appl Econom 12(3):313–336CrossRefGoogle Scholar
  7. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc Ser B 39:1–38zbMATHGoogle Scholar
  8. Fisher RA (1936) The use of multiple measurements in taxonomic poblems. Ann Eugen 7:179–188CrossRefGoogle Scholar
  9. Forgy E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21:768–780Google Scholar
  10. Gillespie NA, Neale MC (2006) A finite mixture model for genotype and environment interactions: detecting latent population heterogeneity. Twin Res Hum Genet 9(3):412–23CrossRefGoogle Scholar
  11. Kahraman HT, Sagiroglu S, Colak I (2013) Developing intuitive knowledge classifier and modeling of users’ domain dependent data in web. Knowl Based Syst 37:283–295CrossRefGoogle Scholar
  12. Kaufman L, Rousseuw PJ (1990) Finding groups in data. Wiley, New YorkCrossRefGoogle Scholar
  13. Lee SX, McLachlan GJ (2013) Model-based clustering and classification with non-normal mixture distributions. Stat Methods Appl 22(4):427–454MathSciNetCrossRefzbMATHGoogle Scholar
  14. Maitra R, Melnykov V (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat 19(2):354–376MathSciNetCrossRefGoogle Scholar
  15. McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New YorkzbMATHGoogle Scholar
  16. Melnykov V (2013) Challenges in model-based clustering. WIREs: Comput Stat 5:135–148Google Scholar
  17. Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116MathSciNetCrossRefzbMATHGoogle Scholar
  18. Melnykov V, Chen WC, Maitra R (2012) MixSim: R package for simulating datasets with pre-specified clustering complexity. J Stat Softw 51:1–25CrossRefGoogle Scholar
  19. Melnykov Y, Melnykov V, Zhu X (2017) Studying contributions of variables to classification. Stat Probab Lett 129:318–325MathSciNetCrossRefzbMATHGoogle Scholar
  20. Ripley B, Tierney L, Urbanek S (2011) Package ’parallel’.
  21. Schlattmann P (2009) Medical applications of finite mixture models. Springer, BerlinzbMATHGoogle Scholar
  22. Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38:1409–1438Google Scholar
  23. Wang SJ, Woodward WA, Gray HL, Wiechecki S, Satin SR (1997) A new test for outlier detection from a multivariate mixture distribution. J Comput Graph Stat 6:285–299MathSciNetGoogle Scholar
  24. Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236–244MathSciNetCrossRefGoogle Scholar
  25. Zhu X, Melnykov V (2015) Probabilistic assessment of model-based clustering. Adv Data Anal Classif 9(4):395–422MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of MathematicsThe University of LouisvilleLouisvilleUSA

Personalised recommendations