Skip to main content
Log in

High-dimensional variable selection with the plaid mixture model for clustering

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. IPA (Ingenuity\(^{\textregistered }\) Systems, www.ingenuity.com) is a software for interactive pathway analysis of complex ’omic data.

References

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723

    Article  MathSciNet  MATH  Google Scholar 

  • Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 194–218

  • Bhattacharya AK (2005) Evaluation of headache. J Indian Acad Clin Med 6(1):17–22

    MathSciNet  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  • Booth JG, Hobert JP (1999) Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J R Stat Soc Ser B (Stat Methodol) 61(1):265–285

    Article  MATH  Google Scholar 

  • Chekouo T, Murua A (2015) The penalized biclustering model and related algorithms. J Appl Stat 42(6):1255–1277

    Article  MathSciNet  Google Scholar 

  • Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Eighth IEEE international conference on data mining, 2008. ICDM ’08, pp 791 –796

  • Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 2009 Ninth IEEE international conference on data mining, pp 776–781

  • George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7:339–374

    MATH  Google Scholar 

  • Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. J Mach Learn Res Proc Track 2:187–194

    Google Scholar 

  • Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800

    Article  MATH  Google Scholar 

  • Hoff PD (2006) Model-based subspace clustering. Bayesian Anal 1(2):321–344

    Article  MathSciNet  MATH  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  MATH  Google Scholar 

  • Kim S, Tadesse MG, Vannucci M (2006) Variable selection in clustering via Dirichlet process mixture models. Biometrika 93(4):877–893

    Article  MathSciNet  MATH  Google Scholar 

  • Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86

    MathSciNet  MATH  Google Scholar 

  • Levine R, Casella G (2001) Implementations of the Monte Carlo EM algorithm. J Comput Graph Stat 10(10):422–439

    Article  MathSciNet  Google Scholar 

  • Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202–1214

    Article  MathSciNet  MATH  Google Scholar 

  • Linehan W, Srinivasan R, Schmidt L (2010) The genetic basis of kidney cancer: a metabolic disease. Nat Rev Urol 7(5):277–285

    Article  Google Scholar 

  • Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45. https://doi.org/10.1109/TCBB.2004.2

    Article  Google Scholar 

  • Maugis C, Celeux G, Martin-Magniette ML (2009a) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701–709

    Article  MathSciNet  MATH  Google Scholar 

  • Maugis C, Celeux G, Martin-Magniette ML (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53(11):3872–3882

    Article  MathSciNet  MATH  Google Scholar 

  • McCulloch CE (1997) Maximum likelihood algorithms for generalized linear mixed models. J Am Stat Assoc 92(437):162–170

    Article  MathSciNet  MATH  Google Scholar 

  • Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164

    MATH  Google Scholar 

  • Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178

    Article  MathSciNet  MATH  Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

    Article  Google Scholar 

  • Robert C, Casella G (2004) Monte Carlo statistical methods. Springer texts in statistics. Springer, Berlin

    Book  Google Scholar 

  • Robert CP, Rydn T, Titterington D (1999) Convergence controls for MCMC algorithms, with applications to hidden Markov chains. J Stat Comput Simul 64(4):327–355

    Article  MathSciNet  MATH  Google Scholar 

  • Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002) Bayesian measures of model complexity and fit (with discussion). J R Stat Soc Ser B 64:583–639

    Article  MathSciNet  MATH  Google Scholar 

  • Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:602–617. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:100:y:2005:p:602-617

  • Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman and Hall/CRC Computer and Information Science Series, London

    Google Scholar 

  • The Cancer Genome Atlas Research Network (2013) Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499:43–49

    Article  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2000) Estimating the number of clusters in a dataset via the gap statistic 63:411–423

    Google Scholar 

  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  • Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–448. https://doi.org/10.1111/j.1541-0420.2007.00922.x

    Article  MathSciNet  MATH  Google Scholar 

  • Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704

    Article  Google Scholar 

  • Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105(490):713–726

    Article  MathSciNet  MATH  Google Scholar 

  • Xie B, Pan W, Shen X (2008) Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64(3):921–930. https://doi.org/10.1111/j.1541-0420.2007.00955.x

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou H (2009) Manual for program of the algorithm of Pan, W. and Shen, X. (2007). http://www.biostat.umn.edu/~weip/prog.html. Accessed June 2016

  • Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Statist 3:1473–1496. https://doi.org/10.1214/09-EJS487

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors are grateful to LeeAnn Chastain at MD Anderson Cancer Center for editing assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alejandro Murua.

Additional information

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) through Grant Number 327689-06.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Materials

The accompanying supplementary document presents: a more detailed description of the similarity of our model with the multiplicative mixture model (Section A); further details on the EM updating equations and the Monte Carlo error (Section B), the simulation setup (Section C), the effective number of parameters, including a comparison between AIC and BIC results (Section D), and a clustering sensitivity study on the choice of the number of nearest-neighbors used to impute the missing data in the TCGA Kidney cancer application (Section E). ESM 1 (pdf 247kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chekouo, T., Murua, A. High-dimensional variable selection with the plaid mixture model for clustering. Comput Stat 33, 1475–1496 (2018). https://doi.org/10.1007/s00180-018-0818-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-018-0818-7

Keywords

Navigation