Issues of robustness and high dimensionality in cluster analysis

Basford, Kaye; McLachlan, Geoff; Bean, Richard

doi:10.1007/978-3-7908-1709-6_1

Kaye Basford²,
Geoff McLachlan³ &
Richard Bean⁴

1512 Accesses

Abstract

Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the t mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations n is very large relative to their dimension p. As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate t family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821 (1993)
Article MATH MathSciNet Google Scholar
Campbell, N.A.: Mixture models and atypical values. Math. Geol., 16, 465–477 (1984)
Article Google Scholar
Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Appl. Stat., 32, 267–275 (1983)
Article MATH Google Scholar
Coleman, D., Dong, X., Hardin, J., Rocke, D.M., Woodruff, D.L.: Some computational issues in cluster analysis with no a priori metric. Comp. Stat. Data Anal., 31, 1–11 (1999)
Article MATH Google Scholar
Davies, P.L., Gather, U.: Breakdown and groups (with discussion). Ann. Stat., 33, 977–1035 (2005)
Article MATH MathSciNet Google Scholar
Dempster, A.P, Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Donoho, D.L., Huber, J.: The notion of breakdown point. In: Bickel, P.J., Doksum, K.A., Hodges, J.L. (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, CA (1983)
Google Scholar
Fokoué, E., Titterington, D.M.: Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach. Learn., 50, 73–94 (2002)
Article Google Scholar
Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. Techncial Report, University of Toronto (1997)
Google Scholar
Hadi, A.S., Luccño, A.: Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comp. Stat. Data Anal., 25, 251–272 (1997)
Article MATH Google Scholar
Hampel, F.R. A general qualitative definition of robustness. Ann. Math. Stat., 42, 1887–1896 (1971)
MathSciNet MATH Google Scholar
Hartigan, J.A.: Statistical theory in clustering. J. Classif., 2, 63–76 (1975)
Article MathSciNet Google Scholar
Hennig, C.: Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann. Stat., 32, 1313–1340 (2004)
Article MATH MathSciNet Google Scholar
Hinton, G.E., Dayan, P., Revov, M.: Modeling the manifolds of images of handwritten digits. IEEE Trans. Neur. Networks, 8, 65–73
Google Scholar
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
MATH Google Scholar
Kent, J.T., Tyler, D.E., Vardi, Y.: A curious likelihood identity for the multivariate t-distribution. Comm. Stat. Sim Comp., 23, 441–453 (1994)
Article MATH MathSciNet Google Scholar
Kotz, S. Nadarajah, S.: Multivariate t distributions and their applications. Cambridge University Press, New York (2004)
MATH Google Scholar
Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method. Butterworths, London (1971)
MATH Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987)
MATH Google Scholar
Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multiv. Anal., 63, 296–312 (1997)
Article MATH Google Scholar
Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648 (1994)
Article MATH MathSciNet Google Scholar
Liu, C., Rubin, D.B.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 5:19–39 (1995)
MATH MathSciNet Google Scholar
Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85, 755–770 (1998)
Article MATH MathSciNet Google Scholar
Markatou, M.: Mixture models, robustness and the weighted likelihood methodology. Biom., 56, 483–486 (2000)
MATH Google Scholar
Markatou, M., Basu, A., Lindsay, B.G.: Weighted likelihood equations with bootstrap root search. J. Amer. Stat. Assoc., 93, 740–750 (1998)
Article MATH MathSciNet Google Scholar
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)
MATH Google Scholar
McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t distributions. Lec. Notes Comput. Sci., 1451, 658–666 (1998)
Article MathSciNet Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
MATH Google Scholar
McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed) Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2000)
Google Scholar
McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of mixture of factor analyzers model to incorporate the multivariate t distribution. To appear in Comp. Stat. Data Anal. (2006)
Google Scholar
McLachlan, G.J., Ng, S.-K., Bean, R.W.: Robust cluster analysis via mixture models. To appear in Aust. J. Stat. (2006)
Google Scholar
McLachlan, G.J., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comp. Stat. Data Anal., 41, 379–388 (2003)
Article MathSciNet Google Scholar
Meng, X.L., van Dyk, D.: The EM algorithm—an old folk song sung to a fast new tune (with discussion). J. R. Stat. Soc. B, 59, 511–567 (1997)
Article MATH Google Scholar
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278 (1993)
Article MATH MathSciNet Google Scholar
Müller, C.H., Neykov, N.: Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models. J. Stat. Plann. Infer., 116, 503–519 (2004)
Article Google Scholar
Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Compstat 2004, Proceedings Computational Statistics. Physica-Verlag, Vienna (2004)
Google Scholar
Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput., 10, 335–344 (2000)
Article Google Scholar
Rocke, D.M.: Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann. Stat., 24, 1327–1345 (1996)
Article MATH MathSciNet Google Scholar
Rocke, D.M., Woodruff, D.L.: Identification of outliers in multivariate data. J. Amer. Stat. Assoc., 91, 1047–1061 (1996)
Article MATH MathSciNet Google Scholar
Rocke, D.M., Woodruff, D.L.: Robust estimation of multivariate location and shape. J. Stat. Plann. Infer., 57, 245–255 (1997)
Article MATH MathSciNet Google Scholar
Rubin, D.B.: Iteratively reweighted least squares. In: Kotz, S., Johnson, N.L., and Read, C.B. (eds) Encyclopedia of Statistical Sciences, Vol. 4. Wiley, New York (1983)
Google Scholar
Tibshirani, R., Knight, K.: Model search by bootstrap “bumping”. J. Comp. Graph. Stat., 8, 671–686 (1999)
Article Google Scholar
Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analysers. Technical Report, Neural Computing Research Group, Aston University (1997)
Google Scholar
Vandev, D.L., Neykov, N.: About regression estimators with high breakdown point. Ann. Stat., 32, 111–129 (1998)
MATH MathSciNet Google Scholar
Woodruff, D.L., Rocke, D.M.: Heuristic search algorithms for the minimum volume ellipsoid. J. Comp. Graph. Stat., 2, 69–95 (1993)
Article Google Scholar
Woodruff, D.L., Rocke, D.M.: Computable robust estimation of multivariate location and shape using compound estimators. J. Amer. Stat. Assoc., 89, 888–896 (1994)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Land and Food Sciences, University of Queensland, Australia
Kaye Basford
Department of Mathematics & Institute for Molecular Bioscience, University of Queensland, Australia
Geoff McLachlan
Institute for Molecular Bioscience, University of Queensland, Australia
Richard Bean

Authors

Kaye Basford
View author publications
You can also search for this author in PubMed Google Scholar
Geoff McLachlan
View author publications
You can also search for this author in PubMed Google Scholar
Richard Bean
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Rome “La Sapienza”, Piazzale Aldo Moro 5, 00185, Rome, Italy
Alfredo Rizzi & Maurizio Vichi &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Basford, K., McLachlan, G., Bean, R. (2006). Issues of robustness and high dimensionality in cluster analysis. In: Rizzi, A., Vichi, M. (eds) Compstat 2006 - Proceedings in Computational Statistics. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-1709-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-7908-1709-6_1
Publisher Name: Physica-Verlag HD
Print ISBN: 978-3-7908-1708-9
Online ISBN: 978-3-7908-1709-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics