Abstract
With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.
Similar content being viewed by others
Notes
IPA (Ingenuity\(^{\textregistered }\) Systems, www.ingenuity.com) is a software for interactive pathway analysis of complex ’omic data.
References
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 194–218
Bhattacharya AK (2005) Evaluation of headache. J Indian Acad Clin Med 6(1):17–22
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Booth JG, Hobert JP (1999) Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J R Stat Soc Ser B (Stat Methodol) 61(1):265–285
Chekouo T, Murua A (2015) The penalized biclustering model and related algorithms. J Appl Stat 42(6):1255–1277
Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Eighth IEEE international conference on data mining, 2008. ICDM ’08, pp 791 –796
Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 2009 Ninth IEEE international conference on data mining, pp 776–781
George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7:339–374
Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. J Mach Learn Res Proc Track 2:187–194
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Hoff PD (2006) Model-based subspace clustering. Bayesian Anal 1(2):321–344
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Kim S, Tadesse MG, Vannucci M (2006) Variable selection in clustering via Dirichlet process mixture models. Biometrika 93(4):877–893
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86
Levine R, Casella G (2001) Implementations of the Monte Carlo EM algorithm. J Comput Graph Stat 10(10):422–439
Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202–1214
Linehan W, Srinivasan R, Schmidt L (2010) The genetic basis of kidney cancer: a metabolic disease. Nat Rev Urol 7(5):277–285
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45. https://doi.org/10.1109/TCBB.2004.2
Maugis C, Celeux G, Martin-Magniette ML (2009a) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701–709
Maugis C, Celeux G, Martin-Magniette ML (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53(11):3872–3882
McCulloch CE (1997) Maximum likelihood algorithms for generalized linear mixed models. J Am Stat Assoc 92(437):162–170
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Robert C, Casella G (2004) Monte Carlo statistical methods. Springer texts in statistics. Springer, Berlin
Robert CP, Rydn T, Titterington D (1999) Convergence controls for MCMC algorithms, with applications to hidden Markov chains. J Stat Comput Simul 64(4):327–355
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002) Bayesian measures of model complexity and fit (with discussion). J R Stat Soc Ser B 64:583–639
Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:602–617. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:100:y:2005:p:602-617
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman and Hall/CRC Computer and Information Science Series, London
The Cancer Genome Atlas Research Network (2013) Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499:43–49
Tibshirani R, Walther G, Hastie T (2000) Estimating the number of clusters in a dataset via the gap statistic 63:411–423
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–448. https://doi.org/10.1111/j.1541-0420.2007.00922.x
Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704
Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105(490):713–726
Xie B, Pan W, Shen X (2008) Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64(3):921–930. https://doi.org/10.1111/j.1541-0420.2007.00955.x
Zhou H (2009) Manual for program of the algorithm of Pan, W. and Shen, X. (2007). http://www.biostat.umn.edu/~weip/prog.html. Accessed June 2016
Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Statist 3:1473–1496. https://doi.org/10.1214/09-EJS487
Acknowledgements
The authors are grateful to LeeAnn Chastain at MD Anderson Cancer Center for editing assistance.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) through Grant Number 327689-06.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary Materials
The accompanying supplementary document presents: a more detailed description of the similarity of our model with the multiplicative mixture model (Section A); further details on the EM updating equations and the Monte Carlo error (Section B), the simulation setup (Section C), the effective number of parameters, including a comparison between AIC and BIC results (Section D), and a clustering sensitivity study on the choice of the number of nearest-neighbors used to impute the missing data in the TCGA Kidney cancer application (Section E). ESM 1 (pdf 247kb)
Rights and permissions
About this article
Cite this article
Chekouo, T., Murua, A. High-dimensional variable selection with the plaid mixture model for clustering. Comput Stat 33, 1475–1496 (2018). https://doi.org/10.1007/s00180-018-0818-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-018-0818-7