Greedy clustering of count data through a mixture of multinomial PCA


Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

  2. 2.

    Available on the CRAN.

  3. 3.

    In-situ cancers are pre-invasive lesions that get their name from the fact that they have not yet started to spread. Invasive cancer tissues can contain both invasive and in-situ lesions in the same slide.


  1. Aggarwal CC, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, New York, pp 77–128

    Google Scholar 

  2. Akaike H (1998) Information theory and an extension of the maximum likelihood principle. Selected papers of hirotugu akaike. Springer, New York, pp 199–213

    Google Scholar 

  3. Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106

    Article  Google Scholar 

  4. Banfield JD, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 803–821

  5. Bergé LR, Bouveyron C, Corneli M, Latouche P (2019) The latent topic block model for the co-clustering of textual interaction data. Comput Stat Data Anal

  6. Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  7. Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877

    MathSciNet  Article  Google Scholar 

  8. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  9. Bouveyron C, Celeux G, Murphy TB, Raftery AE (2019) Model-based clustering and classification for data science: with applications in R. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge

  10. Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519

    MathSciNet  Article  Google Scholar 

  11. Bouveyron C, Latouche P, Zreik R (2018) The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat Comput 28(1):11–31

    MathSciNet  Article  Google Scholar 

  12. Bui QV, Sayadi K, Amor SB, Bui M (2017) Combining latent dirichlet allocation and k-means for documents clustering: effect of probabilistic based distance measures. In: Asian conference on intelligent information and database systems. Springer, New York, pp 248–257

  13. Buntine W (2002) Variational extensions to em and multinomial pca. In: European conference on machine learning. Springer, New York, pp 23–34

  14. Buntine WL, Perttu S (2003) Is multinomial pca multi-faceted clustering or dimensionality reduction? In AISTATS

  15. Carel L, Alquier P (2017) Simultaneous dimension reduction and clustering via the nmf-em algorithm. arXiv preprint arXiv:1709.03346

  16. Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332

    MathSciNet  Article  Google Scholar 

  17. Chien J-T, Lee C-H, Tan Z-H (2017) Latent dirichlet mixture model. Neurocomputing

  18. Chiquet J, Mariadassou M, Robin S et al (2018) Variational inference for probabilistic poisson pca. Ann Appl Stat 12(4):2674–2698

    MathSciNet  Article  Google Scholar 

  19. Cunningham RB, Lindenmayer DB (2005) Modeling count data of rare species: some statistical issues. Ecology 86(5):1135–1142

    Article  Google Scholar 

  20. Daudin J-J, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183

    MathSciNet  Article  Google Scholar 

  21. Defossez G, Le Guyader-Peyrou S, Uhry Z, Grosclaude P, Remontet L, Colonna M, Dantony E, Delafosse P, Molinié F, Woronoff A-S, et al (2019) Estimations nationales de l’incidence et de la mortalité par cancer en france métropolitaine entre 1990 et 2018. Résultats préliminaires. Saint-Maurice (Fra): Santé publique France

  22. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22

    MathSciNet  MATH  Google Scholar 

  23. Ding C, Li T, Peng W (2008) On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput Stat Data Anal 52(8):3913–3927

    MathSciNet  Article  Google Scholar 

  24. Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218

    Article  Google Scholar 

  25. Ellis IO, Elston CW (2006) Histologic grade. Breast pathology. Elsevier, Amsterdam, pp 225–233

    Google Scholar 

  26. Fordyce JA, Gompert Z, Forister ML, Nice CC (2011) A hierarchical bayesian approach to ecological count data: a flexible tool for ecologists. PLoS ONE 6(11):e26785

    Article  Google Scholar 

  27. Hartigan JA (1975) Clustering algorithms. Wiley, Hoboken

    Google Scholar 

  28. Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Adv Neural Inf Process Syst 856–864

  29. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 289–296

  30. Hornik K, Grün B (2011) topicmodels: an r package for fitting topic models. J Stat Softw 40(13):1–30

    Google Scholar 

  31. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417

    Article  Google Scholar 

  32. Lakhani SR (2012) WHO classification of tumours of the breast. International Agency for Research on Cancer

  33. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2. IEEE, pp 2169–2178

  34. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788

    Article  Google Scholar 

  35. Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 556–562

  36. Liu L, Tang L, Dong W, Yao S, Zhou W (2016) An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1):1608

    Article  Google Scholar 

  37. Mattei P-A, Bouveyron C, Latouche P (2016) Globally sparse probabilistic pca. Artif Intell Stat 976–984

  38. McLachlan G, Peel D (2000) Finite mixture models. Willey Series in Probability and Statistics

  39. Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc: Seri A (Gen) 135(3):370–384

    Google Scholar 

  40. Osborne J (2005) Notes on the use of data transformations. Pract Assess Res Evalu 9(1):42–50

    Google Scholar 

  41. O’hara RB, Kotze DJ (2010) Do not log-transform count data. Methods Ecol Evol 1(2):118–122

    Article  Google Scholar 

  42. R Core Team (2019) R: a language and environment for statistical computing organization. R Foundation for Statistical Computing, Vienna, Austria.

  43. Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, volume 242, Piscataway, pp 133–142

  44. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  45. Rau A, Celeux G, Martin-Magniette M-L, Maugis-Rabusseau C (2011) Clustering high-throughput sequencing data with Poisson mixture models. Research Report RR-7786, INRIA

  46. Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280

    Article  Google Scholar 

  47. Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    MathSciNet  Article  Google Scholar 

  48. Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv preprint arXiv:1409.7419

  49. Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S et al (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Nat Acad Sci USA 100(14):8418–8423

    Article  Google Scholar 

  50. St-Pierre AP, Shikon V, Schneider DC (2018) Count data in biology-data transformation or model reformation? Ecol Evol 8(6):3077–3085

    Article  Google Scholar 

  51. Tipping ME, Bishop CM (1999a) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482

    Article  Google Scholar 

  52. Tipping ME, Bishop CM (1999b) Probabilistic principal component analysis. J R Stat Soc: Ser B (Stat Methodol) 61(3):611–622

    MathSciNet  Article  Google Scholar 

  53. Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge

  54. Watanabe K, Akaho S, Omachi S, Okada M (2010) Simultaneous clustering and dimensionality reduction using variational bayesian mixture model. Classification as a tool for research. Springer, New York, pp 81–89

    Google Scholar 

  55. Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 30th conference on uncertainty in artificial intelligence

  56. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273

  57. Yu S, Yu K, Tresp V, Kriegel H-P (2005) A probabilistic clustering-projection model for discrete data. European conference on principles of data mining and knowledge discovery. Springer, New York, pp 417–428

    Google Scholar 

  58. Zwiener I, Frisch B, Binder H (2014) Transforming rna-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9(1):e85150

    Article  Google Scholar 

Download references


This work was supported by a DIM Math Innov grant from Région Ile-de-France. This work has also been supported by the French government through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002. We are thankful for the support from fédération F2PM, CNRS FR 2036, Paris. Finally, we would like to thank the anonymous reviewers for their helpful comments which contributed to improve the paper.

Author information



Corresponding author

Correspondence to Nicolas Jouvin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Constructing meta-observation

Proof of Proposition 1

$$\begin{aligned} {{\,\mathrm{p}\,}}(X, \theta \mid Y, \, \beta )&= {{\,\mathrm{p}\,}}(\theta ) \times {{\,\mathrm{p}\,}}(X\mid \theta , Y) ,\\&= \prod _{q^\prime } {{\,\mathrm{p}\,}}(\theta _{q^\prime }) \times \prod _i \prod _q \prod _n {\mathcal {M}}_V(w_{in}, \, 1 , \,\beta \theta _q)^{Y_{iq}} , \\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _i \prod _v \prod _n (\beta _{v,\cdot } \theta _q)^{ Y_{iq} w_{inv}} ,\\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _v \prod _i (\beta _{v,\cdot } \theta _q)^{ Y_{iq} x_{iv}} ,\\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _v (\beta _{v,\cdot } \theta _q)^{\sum _i Y_{iq} x_{iv}} , \end{aligned}$$

since \(x_{iv} = \sum _n w_{inv}\). Then, put

$$\begin{aligned} \tilde{X}_q(Y) = \sum _{i=1}^N Y_{iq} x_{i}\, \end{aligned}$$

and this completes the proof of Proposition 1. \(\square \)

Derivation of the lower bound

Lower bound and Proposition 2

The bound of Eq. (14) follows from standard derivation of the evidence lower bound in variational inference. Since the \(\log \) is concave, by Jensen inequality:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y \mid \pi , \beta )&= \log \sum _Z \int _{\theta } {{\,\mathrm{p}\,}}(X, Y, \theta , Z \mid \pi , \beta ) \mathrm{d}\theta ,\\&= \log \sum _Z \int _{\theta } \frac{{{\,\mathrm{p}\,}}(X, Y, \theta , Z \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z, \theta ) } {{\,\mathrm{\mathcal {R}}\,}}(Z, \theta ) \mathrm{d}\theta ,\\&= \log \left( {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \frac{{{\,\mathrm{p}\,}}(X, Y, Z, \theta \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] \right) \\&\ge {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \log \frac{{{\,\mathrm{p}\,}}(X, Y, Z, \theta \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] ,\\&:= {\mathcal {L}}({{\,\mathrm{\mathcal {R}}\,}}(\cdot ); \, \pi , \beta , Y) . \end{aligned}$$

Moreover, the difference between the classification log-likelihood and its bound is exactly the KL divergence between approximate posterior \({{\,\mathrm{\mathcal {R}}\,}}(\cdot )\) and the true one:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y \mid \pi , \beta ) - {\mathcal {L}}({{\,\mathrm{\mathcal {R}}\,}}(\cdot ); \, \pi , \beta , Y)&= - {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \log \frac{{{\,\mathrm{p}\,}}(Z, \theta \mid X, Y, \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] . \end{aligned}$$

Furthermore, the complete expression is given in Proposition 2 as:


$$\begin{aligned}&{\mathcal {J}}_{\text {LDA}}^{(q)}( {{\,\mathrm{\mathcal {R}}\,}};\, \beta , \tilde{X}_q(Y))\nonumber \\&\qquad = \log \varGamma (\textstyle \sum _{k=1}^{K} \alpha _k) - \sum _{k=1}^{K}\log \varGamma (\alpha _k) \nonumber \\&\qquad + \sum _{k=1}^{K} (\alpha _k - 1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad + \sum _{i=1}^N Y_{iq} \sum _{k=1}^K \sum _{n=1}^{L_i} \phi _{ink} \left[ \psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql}) + \sum _{v=1}^{V} w_{inv} \log (\beta _{vk})\right] \nonumber \\&\qquad - \log \varGamma (\textstyle \sum _{k=1}^{K} \gamma _{qk}) - \sum _{k=1}^{K}\log \varGamma (\gamma _{qk}) \nonumber \\&\qquad - \sum _{k=1}^{K} (\gamma _{qk} - 1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad - \sum _{k=1}^K (\gamma _{qk} -1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad - \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \phi _{ink} \log (\phi _{ink}) . \end{aligned}$$

\(\square \)

Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(Z)\)

Proof of Proposition 3

A classical result about mean field inference, see Blei et al. (2017), states that at the optimum, considering all other distributions fixed:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(z_ {in})&= {\mathbb {E}}_{Z^{ \setminus i, n}, \theta } \left[ \log {{\,\mathrm{p}\,}}(X, Z, \theta \mid Y)\right] + {{\,\mathrm{const}\,}}, \end{aligned}$$

where the expectation is taken with respect to all \(Z\) except \(z_{in}\), and to all \(\theta \), assuming \((Z, \theta ) \sim {{\,\mathrm{\mathcal {R}}\,}}\). Developing the latter leads to:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(z_ {in})&= \sum _{k=1}^{K} z_{ink} \left[ \sum _{v=1}^{V} w_{inv} \log (\beta _{vk}) + \psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql}) \right] + {{\,\mathrm{const}\,}}. \end{aligned}$$

Equation (18) characterizes the log density of a multinomial:

$$\begin{aligned} {{\,\mathrm{\mathcal {R}}\,}}(z_{in}) = {\mathcal {M}}_K(z_{in}; \, 1, \,\phi _{in} = (\phi _{in1}, \ldots , \phi _{inK})), \end{aligned}$$

where the quantity inside brackets represents the logarithm of the parameter, modulo the normalizing constant. Hence,

$$\begin{aligned} \forall k, \quad \phi _{ink} \propto \left( \prod _{v=1}^V \beta _{vk}^{w_{inv}} \right) \, \prod _{q=1}^Q \exp \left\{ \psi (\gamma _{qk}) - \psi \left( \textstyle \sum _{l=1}^K \gamma _{ql}\right) \right\} ^{Y_{iq}} . \end{aligned}$$

\(\square \)

Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\)

Proof of Proposition 4

With the same reasoning, the optimal form of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\) is:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(\theta )&= {\mathbb {E}}_{Z}\left[ {{\,\mathrm{p}\,}}(X, Z, \theta \mid Y) \right] \, + \, {{\,\mathrm{const}\,}}\nonumber , \\&= \sum _{q=1}^{Q} \left[ \sum _{k=1}^{K} (\alpha _k - 1) \log (\theta _{qk}) + \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \sum _{k=1}^{K} \phi _{ink} \log (\theta _{qk}) \right] + \, {{\,\mathrm{const}\,}}, \nonumber \\&= \sum _{q=1}^{Q}\sum _{k=1}^{K} \left[ \alpha _k + \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \phi _{ink} - 1 \right] \log (\theta _{qk}) \, + \, {{\,\mathrm{const}\,}}. \end{aligned}$$

Once again, a specific functional form appears as the log of a product of Q independent Dirichlet densities. Then,

$$\begin{aligned} {{\,\mathrm{\mathcal {R}}\,}}(\theta ) = \prod _{q=1}^{Q} {{\,\mathrm{\mathcal {D}}\,}}_K\left( \theta _q; \, \gamma _q=(\gamma _{q1}, \ldots , \gamma _{qK})\right) , \end{aligned}$$

with the Dirichlet parameters inside the brackets of Eq. (19):

$$\begin{aligned} \forall (q,k), \quad \gamma _{qk} = \alpha _k + \sum _{i=1}^{N} Y_{iq}\sum _{n=1}^{L_i} \phi _{ink} . \end{aligned}$$

\(\square \)

Optimization of \(\beta \)

Proof of Proposition 5 (I)

This a constrained maximization problem with K constraints \(\sum _{v=1}^{V} \beta _{vk} = 1\). Isolating terms of Eq. (17) depending on \(\beta \), and denoting constraints multipliers as \((\lambda _k)_k\), the Lagrangian can be written:

$$\begin{aligned} f(\beta , \lambda ) =&\sum _{q=1}^{Q} \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \sum _{v=1}^{V} \phi _{ink} w_{inv} \log (\beta _{vk}) + \sum _{k=1}^{K} \lambda _k (\beta _{vk} - 1) , \\ =&\sum _{i=1}^{N} \sum _{n=1}^{L_i} \sum _{v=1}^{V} \phi _{ink} w_{inv} \log (\beta _{vk}) + \sum _{k=1}^{K} \lambda _k (\beta _{vk} - 1) . \end{aligned}$$

Setting its derivative to 0 leaves:

$$\begin{aligned} \beta _{vk} \propto \sum _{i=1}^{N} \sum _{n=1}^{L_i} \phi _{ink} \, w_{inv} . \end{aligned}$$

\(\square \)

Optimization of \(\pi \)

Proof of Proposition 5 (II)

The bound depends on \(\pi \) only through its clustering term:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(Y \mid \pi ) = \sum _{i=1}^{N}\sum _{q=1}^{Q} Y_{iq} \log (\pi _q) . \end{aligned}$$

Once again, this is a constrained optimization problem, and, introducing the Lagrange multiplier \(\lambda \) associated to the constraint \(\textstyle \sum _{q=1}^{Q} \pi _q = 1\), we get:

$$\begin{aligned} \sum _{q=1}^{Q} \sum _{i=1}^{N} Y_{iq} \log (\pi _q) + \lambda (\textstyle \sum _{q=1}^{Q} \pi _q - 1) . \end{aligned}$$

Setting the derivative with respect to \(\pi _q\) to 0, we get:

$$\begin{aligned} \pi _q = \frac{\sum _{i=1}^{N} Y_{iq}}{N} . \end{aligned}$$

\(\square \)

Model selection

Proof of Proposition 6

Assuming that the parameters \((\pi , \beta )\) follows a prior distribution that factorizes as follow:

$$\begin{aligned} {{\,\mathrm{p}\,}}(\pi , \beta \mid Q, K) = {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \, {{\,\mathrm{p}\,}}(\beta \mid K), \end{aligned}$$


$$\begin{aligned} {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) ={\mathcal {D}}_K(\pi ; \, \eta {\mathbf {1}}_Q) . \end{aligned}$$

Then, the classification log-likelihood is written:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y\mid Q, K)= & {} \log \int _{\pi } \int _{\beta }{{\,\mathrm{p}\,}}(X,Y, \beta , \pi \mid Q, K) \, \mathrm{d}\pi \, \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } \int _{\beta }{{\,\mathrm{p}\,}}(X,Y \mid \beta , \pi , \, Q, K) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \, {{\,\mathrm{p}\,}}(\beta \mid K) \, \mathrm{d}\pi \, \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \, \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \nonumber \\&+ \log \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta . \end{aligned}$$

The first term in Eq. (22) is exact by Dirichlet-Multinomial conjugacy. Setting \(\eta =\frac{1}{2}\) plus a Stirling approximation on the Gamma function as in Daudin et al. (2008) leads to:

$$\begin{aligned} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \approx \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) - \frac{Q-1}{2} \log (D) . \end{aligned}$$

As for the second term, a BIC-like approximation as in Bouveyron et al. (2018) gives:

$$\begin{aligned} \log \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta \approx \max \limits _{\beta } \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) - \frac{K (V-1)}{2} \log (Q). \end{aligned}$$

In practice, \( \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) \) is still intractable, hence we replace it by its variational approximation after convergence of the VEM, \({\mathcal {J}}^\star _{\text {LDA}}\), which is the sum of the meta-observations individual LDA-bounds detailed in Eq. (17) (different from \({\mathcal {L}}\)). In the end, it gives the following criterion:

$$\begin{aligned} {{\,\mathrm{ICL}\,}}(Q, K, Y, X)= & {} {\mathcal {J}}^\star _{\text {LDA}}({{\,\mathrm{\mathcal {R}}\,}}; \, \beta , Y) - \frac{K (V-1)}{2} \log (Q) \nonumber \\&+ \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) - \frac{Q-1}{2} \log (D) . \end{aligned}$$

Note that:

$$\begin{aligned} \max \limits _{\beta } \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) + \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) \approx {\mathcal {L}}^\star , \end{aligned}$$

i.e. the bound after Algorithm 1 converges. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jouvin, N., Latouche, P., Bouveyron, C. et al. Greedy clustering of count data through a mixture of multinomial PCA. Comput Stat 36, 1–33 (2021).

Download citation


  • Clustering
  • Mixture models
  • Count data
  • Dimension reduction
  • Topic modeling
  • Variational inference