High-dimensional variable selection with the plaid mixture model for clustering

Chekouo, Thierry; Murua, Alejandro

doi:10.1007/s00180-018-0818-7

High-dimensional variable selection with the plaid mixture model for clustering

Original Paper
Published: 17 May 2018

Volume 33, pages 1475–1496, (2018)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Abstract

With high-dimensional data, the number of covariates is considerably larger than the sample size. We propose a sound method for analyzing these data. It performs simultaneously clustering and variable selection. The method is inspired by the plaid model. It may be seen as a multiplicative mixture model that allows for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. This characteristic makes it specially suitable for gene expression data. Parameter estimation is performed with the Monte Carlo expectation maximization algorithm and importance sampling. Using extensive simulations and comparisons with competing methods, we show the advantages of our methodology, in terms of both variable selection and clustering. An application of our approach to the gene expression data of kidney renal cell carcinoma taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The HRD-Algorithm: A General Method for Parametric Estimation of Two-Component Mixture Models

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

Article 28 June 2016

Thresher: determining the number of clusters while removing outliers

Article Open access 08 January 2018

Notes

IPA (Ingenuity\(^{\textregistered }\) Systems, www.ingenuity.com) is a software for interactive pathway analysis of complex ’omic data.

References

Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Article MathSciNet MATH Google Scholar
Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA broadcast news transcription and understanding workshop, pp 194–218
Bhattacharya AK (2005) Evaluation of headache. J Indian Acad Clin Med 6(1):17–22
MathSciNet Google Scholar
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Article Google Scholar
Booth JG, Hobert JP (1999) Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J R Stat Soc Ser B (Stat Methodol) 61(1):265–285
Article MATH Google Scholar
Chekouo T, Murua A (2015) The penalized biclustering model and related algorithms. J Appl Stat 42(6):1255–1277
Article MathSciNet Google Scholar
Fu Q, Banerjee A (2008) Multiplicative mixture models for overlapping clustering. In: Eighth IEEE international conference on data mining, 2008. ICDM ’08, pp 791 –796
Fu Q, Banerjee A (2009) Bayesian overlapping subspace clustering. In: Proceedings of the 2009 Ninth IEEE international conference on data mining, pp 776–781
George EI, McCulloch RE (1997) Approaches for Bayesian variable selection. Stat Sin 7:339–374
MATH Google Scholar
Heller KA, Ghahramani Z (2007) A nonparametric Bayesian approach to modeling overlapping clusters. J Mach Learn Res Proc Track 2:187–194
Google Scholar
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800
Article MATH Google Scholar
Hoff PD (2006) Model-based subspace clustering. Bayesian Anal 1(2):321–344
Article MathSciNet MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article MATH Google Scholar
Kim S, Tadesse MG, Vannucci M (2006) Variable selection in clustering via Dirichlet process mixture models. Biometrika 93(4):877–893
Article MathSciNet MATH Google Scholar
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Stat Sin 12:61–86
MathSciNet MATH Google Scholar
Levine R, Casella G (2001) Implementations of the Monte Carlo EM algorithm. J Comput Graph Stat 10(10):422–439
Article MathSciNet Google Scholar
Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202–1214
Article MathSciNet MATH Google Scholar
Linehan W, Srinivasan R, Schmidt L (2010) The genetic basis of kidney cancer: a metabolic disease. Nat Rev Urol 7(5):277–285
Article Google Scholar
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1(1):24–45. https://doi.org/10.1109/TCBB.2004.2
Article Google Scholar
Maugis C, Celeux G, Martin-Magniette ML (2009a) Variable selection for clustering with gaussian mixture models. Biometrics 65(3):701–709
Article MathSciNet MATH Google Scholar
Maugis C, Celeux G, Martin-Magniette ML (2009b) Variable selection in model-based clustering: a general variable role modeling. Comput Stat Data Anal 53(11):3872–3882
Article MathSciNet MATH Google Scholar
McCulloch CE (1997) Maximum likelihood algorithms for generalized linear mixed models. J Am Stat Assoc 92(437):162–170
Article MathSciNet MATH Google Scholar
Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164
MATH Google Scholar
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101:168–178
Article MathSciNet MATH Google Scholar
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Robert C, Casella G (2004) Monte Carlo statistical methods. Springer texts in statistics. Springer, Berlin
Book Google Scholar
Robert CP, Rydn T, Titterington D (1999) Convergence controls for MCMC algorithms, with applications to hidden Markov chains. J Stat Comput Simul 64(4):327–355
Article MathSciNet MATH Google Scholar
Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A (2002) Bayesian measures of model complexity and fit (with discussion). J R Stat Soc Ser B 64:583–639
Article MathSciNet MATH Google Scholar
Tadesse MG, Sha N, Vannucci M (2005) Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:602–617. http://EconPapers.repec.org/RePEc:bes:jnlasa:v:100:y:2005:p:602-617
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. In: Aluru S (ed) Handbook of computational molecular biology. Chapman and Hall/CRC Computer and Information Science Series, London
Google Scholar
The Cancer Genome Atlas Research Network (2013) Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499:43–49
Article Google Scholar
Tibshirani R, Walther G, Hastie T (2000) Estimating the number of clusters in a dataset via the gap statistic 63:411–423
Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Article Google Scholar
Wang S, Zhu J (2008) Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2):440–448. https://doi.org/10.1111/j.1541-0420.2007.00922.x
Article MathSciNet MATH Google Scholar
Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704
Article Google Scholar
Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J Am Stat Assoc 105(490):713–726
Article MathSciNet MATH Google Scholar
Xie B, Pan W, Shen X (2008) Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64(3):921–930. https://doi.org/10.1111/j.1541-0420.2007.00955.x
Article MathSciNet MATH Google Scholar
Zhou H (2009) Manual for program of the algorithm of Pan, W. and Shen, X. (2007). http://www.biostat.umn.edu/~weip/prog.html. Accessed June 2016
Zhou H, Pan W, Shen X (2009) Penalized model-based clustering with unconstrained covariance matrices. Electron J Statist 3:1473–1496. https://doi.org/10.1214/09-EJS487
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to LeeAnn Chastain at MD Anderson Cancer Center for editing assistance.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Minnesota Duluth, 1117 University Drive, Duluth, MN, 55812, USA
Thierry Chekouo
Département de mathématiques et de statistique, Université de Montréal, CP 6128, succ. centre-ville, Montréal, QC, H3C 3J7, Canada
Alejandro Murua

Authors

Thierry Chekouo
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Murua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alejandro Murua.

Additional information

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) through Grant Number 327689-06.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Materials

The accompanying supplementary document presents: a more detailed description of the similarity of our model with the multiplicative mixture model (Section A); further details on the EM updating equations and the Monte Carlo error (Section B), the simulation setup (Section C), the effective number of parameters, including a comparison between AIC and BIC results (Section D), and a clustering sensitivity study on the choice of the number of nearest-neighbors used to impute the missing data in the TCGA Kidney cancer application (Section E). ESM 1 (pdf 247kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chekouo, T., Murua, A. High-dimensional variable selection with the plaid mixture model for clustering. Comput Stat 33, 1475–1496 (2018). https://doi.org/10.1007/s00180-018-0818-7

Download citation

Received: 16 June 2017
Accepted: 10 May 2018
Published: 17 May 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s00180-018-0818-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High-dimensional variable selection with the plaid mixture model for clustering

Abstract

Access this article

Similar content being viewed by others

The HRD-Algorithm: A General Method for Parametric Estimation of Two-Component Mixture Models

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

Thresher: determining the number of clusters while removing outliers

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary Materials

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High-dimensional variable selection with the plaid mixture model for clustering

Abstract

Access this article

Similar content being viewed by others

The HRD-Algorithm: A General Method for Parametric Estimation of Two-Component Mixture Models

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

Thresher: determining the number of clusters while removing outliers

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary Materials

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation