Variable Selection for Mixed Data Clustering: Application in Human Population Genomics

  • Matthieu MarbacEmail author
  • Mohammed Sedki
  • Tienne Patin


Model-based clustering of human population genomic data, composed of 1,318 individuals arisen from western Central Africa and 160,470 markers, is considered. This challenging analysis leads us to develop a new methodology for variable selection in clustering. To explain the differences between subpopulations and to increase the accuracy of the estimates, variable selection is done simultaneously to clustering. We proposed two approaches for selecting variables when clustering is managed by the latent class model (i.e., mixture considering independence within components). The first method simultaneously performs model selection and parameter inference. It optimizes the Bayesian Information Criterion with a modified version of the standard expectation–maximization algorithm. The second method performs model selection without requiring parameter inference by maximizing the Maximum Integrated Complete-data Likelihood criterion. Although the application considers categorical data, the proposed methods are introduced in the general context of mixed data (data composed of different types of features). As the first step, the interest of both proposed methods is shown on simulated and several benchmark real data. Then, we apply the clustering method to the human population genomic data which permits to detect the most discriminative genetic markers. The proposed method implemented in the R package VarSelLCM is available on CRAN.


Human evolutionary genetics Information criterion Mixed data Model-based clustering Variable selection 



  1. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) (pp. 267–281). Budapest: Akadémiai Kiadó.Google Scholar
  2. Alexander, D.H., Novembre, J., Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Research, 19.Google Scholar
  3. Andrews, J.L., & McNicholas, P.D. (2014). Variable selection for clustering and classification. Journal of Classification, 31(2), 136–153.MathSciNetCrossRefGoogle Scholar
  4. Biernacki, C., Celeux, G., Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725.CrossRefGoogle Scholar
  5. Biernacki, C., & Maugis-Rabusseau, C. (2015). High-dimensional clustering. Choix de modèles et agrégation, Sous la direction de J-J. DROESBEKE, G. SAPORTA, C. THOMAS-AGNAN, Technip.Google Scholar
  6. Biernacki, C., Celeux, G., Govaert, G. (2010). Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. Journal of Statistical Planning and Inference, 140(11), 2991–3002.MathSciNetCrossRefGoogle Scholar
  7. Bontemps, D., & Toussile, W. (2013). Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, 7, 2344–2371.MathSciNetCrossRefGoogle Scholar
  8. Bretagnolle, V. (2007). Personal communication. source: Museum.Google Scholar
  9. Brown, G. (2004). Diversity in Neural Network Ensembles. The University of Birmingham.Google Scholar
  10. Celeux, G., & Govaert, G. (1991). Clustering criteria for discrete data and latent class models. Journal of Classification, 8(2), 157–176.CrossRefGoogle Scholar
  11. Celeux, G., Martin-Magniette, M., Maugis-Rabusseau, C., Raftery, A.E. (2009). Comparing model selection and regularization approaches to variable selection in model-based clustering. Journal de la Societe francaise de statistique, 155(2), 57.MathSciNetzbMATHGoogle Scholar
  12. Chang, C., Chow, C., Tellier, L., Vattikuti, S., Purcell, S.M., Lee, J. (2015). Second-generation plink: rising to the challenge of larger and richer datasets. GigaScience, 4.Google Scholar
  13. Dean, N., & Raftery, A.E. (2010). Latent class analysis variable selection. Annals of the Institute of Statistical Mathematics, 62(1), 11–35.MathSciNetCrossRefGoogle Scholar
  14. Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.MathSciNetCrossRefGoogle Scholar
  15. Flury, B., & Riedwyl, H. (1988). Multivariate Statistics: a practical approach. London: Chapman and Hall.CrossRefGoogle Scholar
  16. Fop, M., Smart, K.M., Murphy, T.B. (2017). Variable selection for latent class analysis with application to low back pain diagnosis. The Annals of Applied Statistics, 11(4), 2080–2110.MathSciNetCrossRefGoogle Scholar
  17. Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5(2), 205–228.MathSciNetCrossRefGoogle Scholar
  18. Francois, O., Currat, M., Ray, N., Han, E., Excoffier, L., Novembre, J. (2010). Principal component analysis under population genetic models of range expansion and admixture. Molecular Biology and Evolution, 27.Google Scholar
  19. Friel, N., & Wyse, J. (2012). Estimating the evidence–a review. Statistica Neerlandica, 66(3), 288–308.MathSciNetCrossRefGoogle Scholar
  20. Golub, T., & al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439), 531–537.CrossRefGoogle Scholar
  21. Goodman, L.A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61(2), 215–231.MathSciNetCrossRefGoogle Scholar
  22. Green, P.J. (1990). On use of the em for penalized likelihood estimation. Journal of the Royal Statistical Society. Series B (Methodological), 443–452.Google Scholar
  23. Hand, D.J., & Yu, K. (2001). Idiot’s Bayes — not so stupid after all?. International Statistical Review, 69(3), 385–398.zbMATHGoogle Scholar
  24. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1), 193–218.CrossRefGoogle Scholar
  25. Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā: The Indian Journal of Statistics Series A, 49–66.Google Scholar
  26. Kettenring, J.R. (2006). The practice of cluster analysis. Journal of Classification, 23(1), 3–30.MathSciNetCrossRefGoogle Scholar
  27. Lawson, D.J., & Falush, D. (2012). Population identification using genetic data. Annual review of genomics and human genetics, 13.Google Scholar
  28. Marbac, M., & Sedki, M. (2017). Variable selection for model-based clustering using the integrated complete-data likelihood. Statistics and Computing, 27(4), 1049–1063.MathSciNetCrossRefGoogle Scholar
  29. Massart, P. (2007). Concentration inequalities and model selection Vol. 6. Berlin: Springer.zbMATHGoogle Scholar
  30. Maugis, C., Celeux, G., Martin-Magniette, M. (2009a). Variable selection for clustering with Gaussian mixture models. Biometrics, 65(3), 701–709.MathSciNetCrossRefGoogle Scholar
  31. Maugis, C., Celeux, G., Martin-Magniette, M.-L. (2009b). Variable selection in model-based clustering: a general variable role modeling. Computational Statistics and Data Analysis, 53, 3872–3882.MathSciNetCrossRefGoogle Scholar
  32. McLachlan, G., & Peel, D. (2000). Finite mixture models Wiley Series in probability and statistics: applied probability and statistics. New York: Wiley-Interscience.CrossRefGoogle Scholar
  33. McLachlan, G.J., & Krishnan, T. (2008). The EM algorithm and extensions. Wiley Series in probability and statistics, second edition. Hoboken: Wiley-Interscience.zbMATHGoogle Scholar
  34. McNicholas, P. (2016a). Mixture model-based classification. Boca Raton: Chapman & Hall/CRC Press.CrossRefGoogle Scholar
  35. McNicholas, P.D. (2016b). Model-based clustering. Journal of Classification, 33 (3), 331–373.MathSciNetCrossRefGoogle Scholar
  36. Menozzi, P., Piazza, A., Cavalli-Sforza, L. (1978). Synthetic maps of human gene frequencies in europeans. Science, 201.Google Scholar
  37. Meynet, C. (2012). Sélection de variables pour la classification non supervisée en grande dimension. PhD thesis, Paris, 11.Google Scholar
  38. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., et al. (2008). Genes mirror geography within Europe. Nature, 456(7218), 98–101.CrossRefGoogle Scholar
  39. Patin, E., Lopez, M., Grollemund, R., Verdu, P., Harmant, C., Quach, H., Laval, G., Perry, G.H., Barreiro, L.B., Froment, A., et al. (2017). Dispersals and genetic adaptation of Bantu-speaking populations in Africa and North America. Science, 356(6337), 543–546.CrossRefGoogle Scholar
  40. Patterson, N., Price, A.L., Reich, D. (2006). Population Structure and Eigenanalysis. PLoS Genetics, 2.Google Scholar
  41. Phillips, C. (2012). Ancestry informative markers. Siegel Jay A and Saukko, Pekka J: Encyclopedia of forensic sciences. Cambridge: Academic Press.Google Scholar
  42. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8), 904–909.CrossRefGoogle Scholar
  43. Pritchard, J.K., Pickrell, J.K., Coop, G. (2010). The genetics of human adaptation: Hard sweeps, soft sweeps, and polygenic adaptation. Current Biology, 20.Google Scholar
  44. Pritchard, J.K., Stephens, M., Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959.Google Scholar
  45. Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473), 168–178.MathSciNetCrossRefGoogle Scholar
  46. Robert, C. (2007). The Bayesian choice: from decision-theoretic foundations to computational implementation. Berlin: Springer.zbMATHGoogle Scholar
  47. Ronan, T., Qi, Z., Naegle, K.M. (2016). Avoiding common pitfalls when clustering biological data. Science Signaling, 9, 432.CrossRefGoogle Scholar
  48. Schlimmer, J.C. (1987). Concept acquisition through representational adjustment. Department of Information and Computer Science University of California. Irvine: CA.Google Scholar
  49. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.MathSciNetCrossRefGoogle Scholar
  50. Scrucca, L., & Raftery, A.E. (2014). clustvarsel: A Package Implementing Variable Selection for Model-based Clustering in R. (submitted to) Journal of Statistical Software.Google Scholar
  51. Streuli, H. (1973). Der heutige stand der kaffeechemie. In Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty, 61–72.Google Scholar
  52. Tadesse, M.G., Sha, N., Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association, 100 (470), 602–617.MathSciNetCrossRefGoogle Scholar
  53. White, A., Wyse, J., Murphy, T.B. (2016). Bayesian variable selection for latent class analysis using a collapsed gibbs sampler. Statistics and Computing, 26(1-2), 511–527.MathSciNetCrossRefGoogle Scholar
  54. Witten, D.M., & Tibshirani, R. (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713–726.MathSciNetCrossRefGoogle Scholar
  55. Yamamoto, M., & Hwang, H. (2017). Dimension-reduced clustering of functional data via subspace separation. Journal of Classification, 34(2), 294–326.MathSciNetCrossRefGoogle Scholar

Copyright information

© The Classification Society 2019

Authors and Affiliations

  1. 1.CRESTEnsaiBruzFrance
  2. 2.UMR Inserm-1181University of Paris-SudOrsayFrance
  3. 3.CNRS URA3012Institut PasteurParisFrance

Personalised recommendations