Advertisement

Computational Statistics

, Volume 33, Issue 3, pp 1475–1496 | Cite as

High-dimensional variable selection with the plaid mixture model for clustering

  • Thierry Chekouo
  • Alejandro Murua
Original Paper
  • 93 Downloads

Abstract

With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.

Keywords

Classification Model selection Multiplicative mixture model Monte Carlo EM Kidney cancer genomic data 

Notes

Acknowledgements

The authors are grateful to LeeAnn Chastain at MD Anderson Cancer Center for editing assistance.

Supplementary material

180_2018_818_MOESM1_ESM.pdf (246 kb)
Supplementary Materials The accompanying supplementary document presents: a more detailed description of the similarity of our model with the multiplicative mixture model (Section A); further details on the EM updating equations and the Monte Carlo error (Section B), the simulation setup (Section C), the effective number of parameters, including a comparison between AIC and BIC results (Section D), and a clustering sensitivity study on the choice of the number of nearest-neighbors used to impute the missing data in the TCGA Kidney cancer application (Section E). ESM 1 (pdf 247kb)

References

  1. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723MathSciNetCrossRefzbMATHGoogle Scholar
  2. Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 194–218Google Scholar
  3. Bhattacharya AK (2005) Evaluation of headache. J Indian Acad Clin Med 6(1):17–22MathSciNetGoogle Scholar
  4. Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725CrossRefGoogle Scholar
  5. Booth JG, Hobert JP (1999) Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J R Stat Soc Ser B (Stat Methodol) 61(1):265–285CrossRefzbMATHGoogle Scholar
  6. Chekouo T, Murua A (2015) The penalized biclustering model and related algorithms. J Appl Stat 42(6):1255–1277MathSciNetCrossRefGoogle Scholar
  7. Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Eighth IEEE international conference on data mining, 2008. ICDM ’08, pp 791 –796Google Scholar
  8. Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 2009 Ninth IEEE international conference on data mining, pp 776–781Google Scholar
  9. George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7:339–374zbMATHGoogle Scholar
  10. Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. J Mach Learn Res Proc Track 2:187–194Google Scholar
  11. Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800CrossRefzbMATHGoogle Scholar
  12. Hoff PD (2006) Model-based subspace clustering. Bayesian Anal 1(2):321–344MathSciNetCrossRefzbMATHGoogle Scholar
  13. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefzbMATHGoogle Scholar
  14. Kim S, Tadesse MG, Vannucci M (2006) Variable selection in clustering via Dirichlet process mixture models. Biometrika 93(4):877–893MathSciNetCrossRefzbMATHGoogle Scholar
  15. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86MathSciNetzbMATHGoogle Scholar
  16. Levine R, Casella G (2001) Implementations of the Monte Carlo EM algorithm. J Comput Graph Stat 10(10):422–439MathSciNetCrossRefGoogle Scholar
  17. Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202–1214MathSciNetCrossRefzbMATHGoogle Scholar
  18. Linehan W, Srinivasan R, Schmidt L (2010) The genetic basis of kidney cancer: a metabolic disease. Nat Rev Urol 7(5):277–285CrossRefGoogle Scholar
  19. Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45.  https://doi.org/10.1109/TCBB.2004.2 CrossRefGoogle Scholar
  20. Maugis C, Celeux G, Martin-Magniette ML (2009a) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701–709MathSciNetCrossRefzbMATHGoogle Scholar
  21. Maugis C, Celeux G, Martin-Magniette ML (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53(11):3872–3882MathSciNetCrossRefzbMATHGoogle Scholar
  22. McCulloch CE (1997) Maximum likelihood algorithms for generalized linear mixed models. J Am Stat Assoc 92(437):162–170MathSciNetCrossRefzbMATHGoogle Scholar
  23. Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164zbMATHGoogle Scholar
  24. Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178MathSciNetCrossRefzbMATHGoogle Scholar
  25. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850CrossRefGoogle Scholar
  26. Robert C, Casella G (2004) Monte Carlo statistical methods. Springer texts in statistics. Springer, BerlinCrossRefGoogle Scholar
  27. Robert CP, Rydn T, Titterington D (1999) Convergence controls for MCMC algorithms, with applications to hidden Markov chains. J Stat Comput Simul 64(4):327–355MathSciNetCrossRefzbMATHGoogle Scholar
  28. Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MathSciNetCrossRefzbMATHGoogle Scholar
  29. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002) Bayesian measures of model complexity and fit (with discussion). J R Stat Soc Ser B 64:583–639MathSciNetCrossRefzbMATHGoogle Scholar
  30. Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:602–617. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:100:y:2005:p:602-617
  31. Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman and Hall/CRC Computer and Information Science Series, LondonGoogle Scholar
  32. The Cancer Genome Atlas Research Network (2013) Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499:43–49CrossRefGoogle Scholar
  33. Tibshirani R, Walther G, Hastie T (2000) Estimating the number of clusters in a dataset via the gap statistic 63:411–423Google Scholar
  34. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525CrossRefGoogle Scholar
  35. Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–448.  https://doi.org/10.1111/j.1541-0420.2007.00922.x MathSciNetCrossRefzbMATHGoogle Scholar
  36. Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRefGoogle Scholar
  37. Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105(490):713–726MathSciNetCrossRefzbMATHGoogle Scholar
  38. Xie B, Pan W, Shen X (2008) Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64(3):921–930.  https://doi.org/10.1111/j.1541-0420.2007.00955.x MathSciNetCrossRefzbMATHGoogle Scholar
  39. Zhou H (2009) Manual for program of the algorithm of Pan, W. and Shen, X. (2007). http://www.biostat.umn.edu/~weip/prog.html. Accessed June 2016
  40. Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Statist 3:1473–1496.  https://doi.org/10.1214/09-EJS487 MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsUniversity of Minnesota DuluthDuluthUSA
  2. 2.Département de mathématiques et de statistiqueUniversité de MontréalMontréalCanada

Personalised recommendations