Cluster Analysis of High-Dimensional Data: A Case Study

Bean, Richard; McLachlan, Geoff

doi:10.1007/11508069_40

Cluster Analysis of High-Dimensional Data: A Case Study

Richard Bean¹⁹ &
Geoff McLachlan^19,20

Conference paper

1325 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3578))

Abstract

Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ashenfelter, O.: California Versus All Challengers: The 1999 Cabernet Challenge (1999), http://www.liquidasset.com/report20.html
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Article MATH MathSciNet Google Scholar
Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Applied Statistics 32, 267–275 (1983)
Article MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journalof the Royal Statistical Society B 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Healy, M.J.R.: Matrices for Statisticans. Clarendon, Oxford (1986)
Google Scholar
Liu, J., Feng, J., Young, S.S. (2005), PowerMV v0.61.http://, http://www.niss.org/PowerMV/
Liu, L., Hawkins, D.M., Ghosh, S., Young, S.S.: Robust singular value decomposition analysis of microarray data. Proceedings of the National Academy of Sciences USA 100, 13167–13172 (2003)
Article MATH MathSciNet Google Scholar
McLachlan, G.J.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36, 318–324 (1987)
Article Google Scholar
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (1997)
MATH Google Scholar
McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed.) Proceedings the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2000a)
Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000b)
Book MATH Google Scholar
McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimen-sional data by mixtures of factor analyzers. Comput. Statist. Data Anal. 41, 379–388 (2003)
Article MathSciNet Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Bostein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001)
Article Google Scholar
Young, S.: Private communication (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

ARC Centre in Bioinformatics, Institute for Molecular Bioscience, UQ,
Richard Bean & Geoff McLachlan
Department of Mathematics, University of Queensland, UQ,
Geoff McLachlan

Authors

Richard Bean
View author publications
You can also search for this author in PubMed Google Scholar
Geoff McLachlan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, University of Queensland, 4072, Australia
Marcus Gallagher
, POB 30031, FL 32503-1031, Pensacola
James P. Hogan
Faculty of Information Technology, Queensland University of Technology, Box 2434, Q 4001, Brisbane, Australia
Frederic Maire

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bean, R., McLachlan, G. (2005). Cluster Analysis of High-Dimensional Data: A Case Study. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_40

Download citation

DOI: https://doi.org/10.1007/11508069_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26972-4
Online ISBN: 978-3-540-31693-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics