Abstract
Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.
Similar content being viewed by others
Notes
Available on the CRAN.
In-situ cancers are pre-invasive lesions that get their name from the fact that they have not yet started to spread. Invasive cancer tissues can contain both invasive and in-situ lesions in the same slide.
References
Aggarwal CC, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, New York, pp 77–128
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. Selected papers of hirotugu akaike. Springer, New York, pp 199–213
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106
Banfield JD, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 803–821
Bergé LR, Bouveyron C, Corneli M, Latouche P (2019) The latent topic block model for the co-clustering of textual interaction data. Comput Stat Data Anal
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Bouveyron C, Celeux G, Murphy TB, Raftery AE (2019) Model-based clustering and classification for data science: with applications in R. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge
Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519
Bouveyron C, Latouche P, Zreik R (2018) The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat Comput 28(1):11–31
Bui QV, Sayadi K, Amor SB, Bui M (2017) Combining latent dirichlet allocation and k-means for documents clustering: effect of probabilistic based distance measures. In: Asian conference on intelligent information and database systems. Springer, New York, pp 248–257
Buntine W (2002) Variational extensions to em and multinomial pca. In: European conference on machine learning. Springer, New York, pp 23–34
Buntine WL, Perttu S (2003) Is multinomial pca multi-faceted clustering or dimensionality reduction? In AISTATS
Carel L, Alquier P (2017) Simultaneous dimension reduction and clustering via the nmf-em algorithm. arXiv preprint arXiv:1709.03346
Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Chien J-T, Lee C-H, Tan Z-H (2017) Latent dirichlet mixture model. Neurocomputing
Chiquet J, Mariadassou M, Robin S et al (2018) Variational inference for probabilistic poisson pca. Ann Appl Stat 12(4):2674–2698
Cunningham RB, Lindenmayer DB (2005) Modeling count data of rare species: some statistical issues. Ecology 86(5):1135–1142
Daudin J-J, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183
Defossez G, Le Guyader-Peyrou S, Uhry Z, Grosclaude P, Remontet L, Colonna M, Dantony E, Delafosse P, Molinié F, Woronoff A-S, et al (2019) Estimations nationales de l’incidence et de la mortalité par cancer en france métropolitaine entre 1990 et 2018. Résultats préliminaires. Saint-Maurice (Fra): Santé publique France
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22
Ding C, Li T, Peng W (2008) On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput Stat Data Anal 52(8):3913–3927
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218
Ellis IO, Elston CW (2006) Histologic grade. Breast pathology. Elsevier, Amsterdam, pp 225–233
Fordyce JA, Gompert Z, Forister ML, Nice CC (2011) A hierarchical bayesian approach to ecological count data: a flexible tool for ecologists. PLoS ONE 6(11):e26785
Hartigan JA (1975) Clustering algorithms. Wiley, Hoboken
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Adv Neural Inf Process Syst 856–864
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 289–296
Hornik K, Grün B (2011) topicmodels: an r package for fitting topic models. J Stat Softw 40(13):1–30
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417
Lakhani SR (2012) WHO classification of tumours of the breast. International Agency for Research on Cancer
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2. IEEE, pp 2169–2178
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 556–562
Liu L, Tang L, Dong W, Yao S, Zhou W (2016) An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1):1608
Mattei P-A, Bouveyron C, Latouche P (2016) Globally sparse probabilistic pca. Artif Intell Stat 976–984
McLachlan G, Peel D (2000) Finite mixture models. Willey Series in Probability and Statistics
Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc: Seri A (Gen) 135(3):370–384
Osborne J (2005) Notes on the use of data transformations. Pract Assess Res Evalu 9(1):42–50
O’hara RB, Kotze DJ (2010) Do not log-transform count data. Methods Ecol Evol 1(2):118–122
R Core Team (2019) R: a language and environment for statistical computing organization. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, volume 242, Piscataway, pp 133–142
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Rau A, Celeux G, Martin-Magniette M-L, Maugis-Rabusseau C (2011) Clustering high-throughput sequencing data with Poisson mixture models. Research Report RR-7786, INRIA
Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv preprint arXiv:1409.7419
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S et al (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Nat Acad Sci USA 100(14):8418–8423
St-Pierre AP, Shikon V, Schneider DC (2018) Count data in biology-data transformation or model reformation? Ecol Evol 8(6):3077–3085
Tipping ME, Bishop CM (1999a) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482
Tipping ME, Bishop CM (1999b) Probabilistic principal component analysis. J R Stat Soc: Ser B (Stat Methodol) 61(3):611–622
Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge
Watanabe K, Akaho S, Omachi S, Okada M (2010) Simultaneous clustering and dimensionality reduction using variational bayesian mixture model. Classification as a tool for research. Springer, New York, pp 81–89
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 30th conference on uncertainty in artificial intelligence
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273
Yu S, Yu K, Tresp V, Kriegel H-P (2005) A probabilistic clustering-projection model for discrete data. European conference on principles of data mining and knowledge discovery. Springer, New York, pp 417–428
Zwiener I, Frisch B, Binder H (2014) Transforming rna-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9(1):e85150
Acknowledgements
This work was supported by a DIM Math Innov grant from Région Ile-de-France. This work has also been supported by the French government through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002. We are thankful for the support from fédération F2PM, CNRS FR 2036, Paris. Finally, we would like to thank the anonymous reviewers for their helpful comments which contributed to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Proofs
Proofs
1.1 Constructing meta-observation
Proof of Proposition 1
since \(x_{iv} = \sum _n w_{inv}\). Then, put
and this completes the proof of Proposition 1. \(\square \)
1.2 Derivation of the lower bound
Lower bound and Proposition 2
The bound of Eq. (14) follows from standard derivation of the evidence lower bound in variational inference. Since the \(\log \) is concave, by Jensen inequality:
Moreover, the difference between the classification log-likelihood and its bound is exactly the KL divergence between approximate posterior \({{\,\mathrm{\mathcal {R}}\,}}(\cdot )\) and the true one:
Furthermore, the complete expression is given in Proposition 2 as:
where
\(\square \)
1.3 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(Z)\)
Proof of Proposition 3
A classical result about mean field inference, see Blei et al. (2017), states that at the optimum, considering all other distributions fixed:
where the expectation is taken with respect to all \(Z\) except \(z_{in}\), and to all \(\theta \), assuming \((Z, \theta ) \sim {{\,\mathrm{\mathcal {R}}\,}}\). Developing the latter leads to:
Equation (18) characterizes the log density of a multinomial:
where the quantity inside brackets represents the logarithm of the parameter, modulo the normalizing constant. Hence,
\(\square \)
1.4 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\)
Proof of Proposition 4
With the same reasoning, the optimal form of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\) is:
Once again, a specific functional form appears as the log of a product of Q independent Dirichlet densities. Then,
with the Dirichlet parameters inside the brackets of Eq. (19):
\(\square \)
1.5 Optimization of \(\beta \)
Proof of Proposition 5 (I)
This a constrained maximization problem with K constraints \(\sum _{v=1}^{V} \beta _{vk} = 1\). Isolating terms of Eq. (17) depending on \(\beta \), and denoting constraints multipliers as \((\lambda _k)_k\), the Lagrangian can be written:
Setting its derivative to 0 leaves:
\(\square \)
1.6 Optimization of \(\pi \)
Proof of Proposition 5 (II)
The bound depends on \(\pi \) only through its clustering term:
Once again, this is a constrained optimization problem, and, introducing the Lagrange multiplier \(\lambda \) associated to the constraint \(\textstyle \sum _{q=1}^{Q} \pi _q = 1\), we get:
Setting the derivative with respect to \(\pi _q\) to 0, we get:
\(\square \)
1.7 Model selection
Proof of Proposition 6
Assuming that the parameters \((\pi , \beta )\) follows a prior distribution that factorizes as follow:
where
Then, the classification log-likelihood is written:
The first term in Eq. (22) is exact by Dirichlet-Multinomial conjugacy. Setting \(\eta =\frac{1}{2}\) plus a Stirling approximation on the Gamma function as in Daudin et al. (2008) leads to:
As for the second term, a BIC-like approximation as in Bouveyron et al. (2018) gives:
In practice, \( \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) \) is still intractable, hence we replace it by its variational approximation after convergence of the VEM, \({\mathcal {J}}^\star _{\text {LDA}}\), which is the sum of the meta-observations individual LDA-bounds detailed in Eq. (17) (different from \({\mathcal {L}}\)). In the end, it gives the following criterion:
Note that:
i.e. the bound after Algorithm 1 converges. \(\square \)
Rights and permissions
About this article
Cite this article
Jouvin, N., Latouche, P., Bouveyron, C. et al. Greedy clustering of count data through a mixture of multinomial PCA. Comput Stat 36, 1–33 (2021). https://doi.org/10.1007/s00180-020-01008-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-01008-9