Clustering of High-Dimensional and Correlated Data

  • Geoffrey J. McLachlanEmail author
  • Shu-Kay Ng
  • K. Wang
Conference paper
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


Finite mixture models are being commonly used in a wide range of applications in practice concerning density estimation and clustering. An attractive feature of this approach to clustering is that it provides a sound statistical framework in which to assess the important question of how many clusters there are in the data and their validity. We consider the applications of normal mixture models to high-dimensional data of a continuous nature. One way to handle the fitting of normal mixture models is to adopt mixtures of factor analyzers. However, for extremely high-dimensional data, some variable-reduction method needs to be used in conjunction with the latter model such as with the procedure called EMMIX-GENE. It was developed for the clustering of microarray data in bioinformatics, but is applicable to other types of data. We shall also consider the mixture procedure EMMIX-WIRE (based on mixtures of normal components with random effects), which is suitable for clustering high-dimensional data that may be structured (correlated and replicated) as in longitudinal studies.


Mixture Model Component Density Factor Analyzer Model Finite Mixture Model Normal Mixture Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Banfield, J., & Raftery, A. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49, 803–821.zbMATHCrossRefMathSciNetGoogle Scholar
  2. Day, N. (1969). Estimating the components of a mixture of two normal distributions. Biometrika, 56, 463–474.zbMATHCrossRefMathSciNetGoogle Scholar
  3. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B, 39, 1–38.zbMATHMathSciNetGoogle Scholar
  4. Galimberti, G., & Soffritti, G. (2007). Model-based methods for identifying multiple cluster structures in a data set. Computational Statistics and Data Analysis, 52, 520–536.zbMATHCrossRefMathSciNetGoogle Scholar
  5. Hinton, G., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–73.CrossRefGoogle Scholar
  6. McLachlan, G., Bean, R., & Ben-Tovim Jones, L. (2007). Extension of the mixture of factor analyzers model to incorporate the multivariate t distribution. Computational Statistics and Data Analysis, 51, 5327–5338.zbMATHCrossRefMathSciNetGoogle Scholar
  7. McLachlan, G., Bean, R., & Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413–422.CrossRefGoogle Scholar
  8. McLachlan, G., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley.zbMATHGoogle Scholar
  9. McLachlan, G., & Peel, D. (1998). Robust cluster analysis via mixtures of multivariate t-distributions. In: A. Amin, D. Dori, P. Pudil, & H. Freeman (Eds.), Lecture notes in computer science (Vol. 1451, pp. 658–666). Berlin: Springer.Google Scholar
  10. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley.zbMATHCrossRefGoogle Scholar
  11. McLachlan, G., Peel, D., & Bean, R. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41, 379–388.CrossRefMathSciNetGoogle Scholar
  12. Meng, X., & van Dyk, D. (1997). The EM algorithm – an old folk song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society B, 59, 511–567.zbMATHCrossRefGoogle Scholar
  13. Ng, S., McLachlan, G., Wang, K., Ben-Tovim Jones, L., & Ng, S. (2006). A mixture model with random-effects components for clustering correlated gene-expression profiles. Bioinformatics, 22, 1745–1752.CrossRefGoogle Scholar
  14. Scott, A., & Symons, M. (1971). Clustering methods based on likelihood ratio criteria. Biometrics, 27, 387–397.CrossRefGoogle Scholar
  15. Soffritti, G. (2003). Identifying multiple cluster structures in a data matrix. Communications in Statistics – Simulation and Computation, 32, 1151–1177.Google Scholar
  16. Wolfe, J. (1965). A computer program for the computation of maximum likelihood analysis of types (Technical Report SRM 65-112). US Naval Personnel Research Activity, San Diego.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  1. 1.Department of Mathematics and Institute for Molecular BioscienceUniversity of QueenslandBrisbaneAustralia

Personalised recommendations