Abstract
Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the t mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations n is very large relative to their dimension p. As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate t family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821 (1993)
Campbell, N.A.: Mixture models and atypical values. Math. Geol., 16, 465–477 (1984)
Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Appl. Stat., 32, 267–275 (1983)
Coleman, D., Dong, X., Hardin, J., Rocke, D.M., Woodruff, D.L.: Some computational issues in cluster analysis with no a priori metric. Comp. Stat. Data Anal., 31, 1–11 (1999)
Davies, P.L., Gather, U.: Breakdown and groups (with discussion). Ann. Stat., 33, 977–1035 (2005)
Dempster, A.P, Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 1–38 (1977)
Donoho, D.L., Huber, J.: The notion of breakdown point. In: Bickel, P.J., Doksum, K.A., Hodges, J.L. (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, CA (1983)
Fokoué, E., Titterington, D.M.: Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach. Learn., 50, 73–94 (2002)
Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. Techncial Report, University of Toronto (1997)
Hadi, A.S., Luccño, A.: Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comp. Stat. Data Anal., 25, 251–272 (1997)
Hampel, F.R. A general qualitative definition of robustness. Ann. Math. Stat., 42, 1887–1896 (1971)
Hartigan, J.A.: Statistical theory in clustering. J. Classif., 2, 63–76 (1975)
Hennig, C.: Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann. Stat., 32, 1313–1340 (2004)
Hinton, G.E., Dayan, P., Revov, M.: Modeling the manifolds of images of handwritten digits. IEEE Trans. Neur. Networks, 8, 65–73
Huber, P.J.: Robust Statistics. Wiley, New York (1981)
Kent, J.T., Tyler, D.E., Vardi, Y.: A curious likelihood identity for the multivariate t-distribution. Comm. Stat. Sim Comp., 23, 441–453 (1994)
Kotz, S. Nadarajah, S.: Multivariate t distributions and their applications. Cambridge University Press, New York (2004)
Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method. Butterworths, London (1971)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987)
Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multiv. Anal., 63, 296–312 (1997)
Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648 (1994)
Liu, C., Rubin, D.B.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 5:19–39 (1995)
Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85, 755–770 (1998)
Markatou, M.: Mixture models, robustness and the weighted likelihood methodology. Biom., 56, 483–486 (2000)
Markatou, M., Basu, A., Lindsay, B.G.: Weighted likelihood equations with bootstrap root search. J. Amer. Stat. Assoc., 93, 740–750 (1998)
McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)
McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t distributions. Lec. Notes Comput. Sci., 1451, 658–666 (1998)
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)
McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed) Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2000)
McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of mixture of factor analyzers model to incorporate the multivariate t distribution. To appear in Comp. Stat. Data Anal. (2006)
McLachlan, G.J., Ng, S.-K., Bean, R.W.: Robust cluster analysis via mixture models. To appear in Aust. J. Stat. (2006)
McLachlan, G.J., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comp. Stat. Data Anal., 41, 379–388 (2003)
Meng, X.L., van Dyk, D.: The EM algorithm—an old folk song sung to a fast new tune (with discussion). J. R. Stat. Soc. B, 59, 511–567 (1997)
Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278 (1993)
Müller, C.H., Neykov, N.: Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models. J. Stat. Plann. Infer., 116, 503–519 (2004)
Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Compstat 2004, Proceedings Computational Statistics. Physica-Verlag, Vienna (2004)
Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput., 10, 335–344 (2000)
Rocke, D.M.: Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann. Stat., 24, 1327–1345 (1996)
Rocke, D.M., Woodruff, D.L.: Identification of outliers in multivariate data. J. Amer. Stat. Assoc., 91, 1047–1061 (1996)
Rocke, D.M., Woodruff, D.L.: Robust estimation of multivariate location and shape. J. Stat. Plann. Infer., 57, 245–255 (1997)
Rubin, D.B.: Iteratively reweighted least squares. In: Kotz, S., Johnson, N.L., and Read, C.B. (eds) Encyclopedia of Statistical Sciences, Vol. 4. Wiley, New York (1983)
Tibshirani, R., Knight, K.: Model search by bootstrap “bumping”. J. Comp. Graph. Stat., 8, 671–686 (1999)
Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analysers. Technical Report, Neural Computing Research Group, Aston University (1997)
Vandev, D.L., Neykov, N.: About regression estimators with high breakdown point. Ann. Stat., 32, 111–129 (1998)
Woodruff, D.L., Rocke, D.M.: Heuristic search algorithms for the minimum volume ellipsoid. J. Comp. Graph. Stat., 2, 69–95 (1993)
Woodruff, D.L., Rocke, D.M.: Computable robust estimation of multivariate location and shape using compound estimators. J. Amer. Stat. Assoc., 89, 888–896 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Physica-Verlag Heidelberg
About this paper
Cite this paper
Basford, K., McLachlan, G., Bean, R. (2006). Issues of robustness and high dimensionality in cluster analysis. In: Rizzi, A., Vichi, M. (eds) Compstat 2006 - Proceedings in Computational Statistics. Physica-Verlag HD. https://doi.org/10.1007/978-3-7908-1709-6_1
Download citation
DOI: https://doi.org/10.1007/978-3-7908-1709-6_1
Publisher Name: Physica-Verlag HD
Print ISBN: 978-3-7908-1708-9
Online ISBN: 978-3-7908-1709-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)