Skip to main content

Cluster Analysis of High-Dimensional Data: A Case Study

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3578))

Abstract

Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Ashenfelter, O.: California Versus All Challengers: The 1999 Cabernet Challenge (1999), http://www.liquidasset.com/report20.html

  • Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  • Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Applied Statistics 32, 267–275 (1983)

    Article  MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journalof the Royal Statistical Society B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  • Healy, M.J.R.: Matrices for Statisticans. Clarendon, Oxford (1986)

    Google Scholar 

  • Liu, J., Feng, J., Young, S.S. (2005), PowerMV v0.61.http://, http://www.niss.org/PowerMV/

  • Liu, L., Hawkins, D.M., Ghosh, S., Young, S.S.: Robust singular value decomposition analysis of microarray data. Proceedings of the National Academy of Sciences USA 100, 13167–13172 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  • McLachlan, G.J.: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36, 318–324 (1987)

    Article  Google Scholar 

  • McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (1997)

    MATH  Google Scholar 

  • McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed.) Proceedings the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2000a)

    Google Scholar 

  • McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000b)

    Book  MATH  Google Scholar 

  • McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimen-sional data by mixtures of factor analyzers. Comput. Statist. Data Anal. 41, 379–388 (2003)

    Article  MathSciNet  Google Scholar 

  • Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Bostein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001)

    Article  Google Scholar 

  • Young, S.: Private communication (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bean, R., McLachlan, G. (2005). Cluster Analysis of High-Dimensional Data: A Case Study. In: Gallagher, M., Hogan, J.P., Maire, F. (eds) Intelligent Data Engineering and Automated Learning - IDEAL 2005. IDEAL 2005. Lecture Notes in Computer Science, vol 3578. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11508069_40

Download citation

  • DOI: https://doi.org/10.1007/11508069_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26972-4

  • Online ISBN: 978-3-540-31693-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics