Skip to main content

Issues of robustness and high dimensionality in cluster analysis

  • Conference paper
Compstat 2006 - Proceedings in Computational Statistics

Abstract

Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the t mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations n is very large relative to their dimension p. As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate t family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  2. Campbell, N.A.: Mixture models and atypical values. Math. Geol., 16, 465–477 (1984)

    Article  Google Scholar 

  3. Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Appl. Stat., 32, 267–275 (1983)

    Article  MATH  Google Scholar 

  4. Coleman, D., Dong, X., Hardin, J., Rocke, D.M., Woodruff, D.L.: Some computational issues in cluster analysis with no a priori metric. Comp. Stat. Data Anal., 31, 1–11 (1999)

    Article  MATH  Google Scholar 

  5. Davies, P.L., Gather, U.: Breakdown and groups (with discussion). Ann. Stat., 33, 977–1035 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  6. Dempster, A.P, Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  7. Donoho, D.L., Huber, J.: The notion of breakdown point. In: Bickel, P.J., Doksum, K.A., Hodges, J.L. (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, CA (1983)

    Google Scholar 

  8. Fokoué, E., Titterington, D.M.: Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach. Learn., 50, 73–94 (2002)

    Article  Google Scholar 

  9. Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. Techncial Report, University of Toronto (1997)

    Google Scholar 

  10. Hadi, A.S., Luccño, A.: Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comp. Stat. Data Anal., 25, 251–272 (1997)

    Article  MATH  Google Scholar 

  11. Hampel, F.R. A general qualitative definition of robustness. Ann. Math. Stat., 42, 1887–1896 (1971)

    MathSciNet  MATH  Google Scholar 

  12. Hartigan, J.A.: Statistical theory in clustering. J. Classif., 2, 63–76 (1975)

    Article  MathSciNet  Google Scholar 

  13. Hennig, C.: Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann. Stat., 32, 1313–1340 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  14. Hinton, G.E., Dayan, P., Revov, M.: Modeling the manifolds of images of handwritten digits. IEEE Trans. Neur. Networks, 8, 65–73

    Google Scholar 

  15. Huber, P.J.: Robust Statistics. Wiley, New York (1981)

    MATH  Google Scholar 

  16. Kent, J.T., Tyler, D.E., Vardi, Y.: A curious likelihood identity for the multivariate t-distribution. Comm. Stat. Sim Comp., 23, 441–453 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  17. Kotz, S. Nadarajah, S.: Multivariate t distributions and their applications. Cambridge University Press, New York (2004)

    MATH  Google Scholar 

  18. Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method. Butterworths, London (1971)

    MATH  Google Scholar 

  19. Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987)

    MATH  Google Scholar 

  20. Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multiv. Anal., 63, 296–312 (1997)

    Article  MATH  Google Scholar 

  21. Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  22. Liu, C., Rubin, D.B.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 5:19–39 (1995)

    MATH  MathSciNet  Google Scholar 

  23. Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85, 755–770 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  24. Markatou, M.: Mixture models, robustness and the weighted likelihood methodology. Biom., 56, 483–486 (2000)

    MATH  Google Scholar 

  25. Markatou, M., Basu, A., Lindsay, B.G.: Weighted likelihood equations with bootstrap root search. J. Amer. Stat. Assoc., 93, 740–750 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  26. McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)

    MATH  Google Scholar 

  27. McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t distributions. Lec. Notes Comput. Sci., 1451, 658–666 (1998)

    Article  MathSciNet  Google Scholar 

  28. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    MATH  Google Scholar 

  29. McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed) Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  30. McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of mixture of factor analyzers model to incorporate the multivariate t distribution. To appear in Comp. Stat. Data Anal. (2006)

    Google Scholar 

  31. McLachlan, G.J., Ng, S.-K., Bean, R.W.: Robust cluster analysis via mixture models. To appear in Aust. J. Stat. (2006)

    Google Scholar 

  32. McLachlan, G.J., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comp. Stat. Data Anal., 41, 379–388 (2003)

    Article  MathSciNet  Google Scholar 

  33. Meng, X.L., van Dyk, D.: The EM algorithm—an old folk song sung to a fast new tune (with discussion). J. R. Stat. Soc. B, 59, 511–567 (1997)

    Article  MATH  Google Scholar 

  34. Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  35. Müller, C.H., Neykov, N.: Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models. J. Stat. Plann. Infer., 116, 503–519 (2004)

    Article  Google Scholar 

  36. Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Compstat 2004, Proceedings Computational Statistics. Physica-Verlag, Vienna (2004)

    Google Scholar 

  37. Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput., 10, 335–344 (2000)

    Article  Google Scholar 

  38. Rocke, D.M.: Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann. Stat., 24, 1327–1345 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  39. Rocke, D.M., Woodruff, D.L.: Identification of outliers in multivariate data. J. Amer. Stat. Assoc., 91, 1047–1061 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  40. Rocke, D.M., Woodruff, D.L.: Robust estimation of multivariate location and shape. J. Stat. Plann. Infer., 57, 245–255 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  41. Rubin, D.B.: Iteratively reweighted least squares. In: Kotz, S., Johnson, N.L., and Read, C.B. (eds) Encyclopedia of Statistical Sciences, Vol. 4. Wiley, New York (1983)

    Google Scholar 

  42. Tibshirani, R., Knight, K.: Model search by bootstrap “bumping”. J. Comp. Graph. Stat., 8, 671–686 (1999)

    Article  Google Scholar 

  43. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analysers. Technical Report, Neural Computing Research Group, Aston University (1997)

    Google Scholar 

  44. Vandev, D.L., Neykov, N.: About regression estimators with high breakdown point. Ann. Stat., 32, 111–129 (1998)

    MATH  MathSciNet  Google Scholar 

  45. Woodruff, D.L., Rocke, D.M.: Heuristic search algorithms for the minimum volume ellipsoid. J. Comp. Graph. Stat., 2, 69–95 (1993)

    Article  Google Scholar 

  46. Woodruff, D.L., Rocke, D.M.: Computable robust estimation of multivariate location and shape using compound estimators. J. Amer. Stat. Assoc., 89, 888–896 (1994)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Physica-Verlag Heidelberg

About this paper

Cite this paper

Basford, K., McLachlan, G., Bean, R. (2006). Issues of robustness and high dimensionality in cluster analysis. In: Rizzi, A., Vichi, M. (eds) Compstat 2006 - Proceedings in Computational Statistics. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-1709-6_1

Download citation

Publish with us

Policies and ethics