Greedy clustering of count data through a mixture of multinomial PCA

Jouvin, Nicolas; Latouche, Pierre; Bouveyron, Charles; Bataillon, Guillaume; Livartowski, Alain

doi:10.1007/s00180-020-01008-9

Greedy clustering of count data through a mixture of multinomial PCA

Original paper
Published: 08 July 2020

Volume 36, pages 1–33, (2021)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Nicolas Jouvin ORCID: orcid.org/0000-0002-0331-1571^1,2,
Pierre Latouche²,
Charles Bouveyron³,
Guillaume Bataillon⁴ &
…
Alain Livartowski⁵

354 Accesses
1 Citation
9 Altmetric
1 Mention
Explore all metrics

Abstract

Count data is becoming more and more ubiquitous in a wide range of applications, with datasets growing both in size and in dimension. In this context, an increasing amount of work is dedicated to the construction of statistical models directly accounting for the discrete nature of the data. Moreover, it has been shown that integrating dimension reduction to clustering can drastically improve performance and stability. In this paper, we rely on the mixture of multinomial PCA, a mixture model for the clustering of count data, also known as the probabilistic clustering-projection model in the literature. Related to the latent Dirichlet allocation model, it offers the flexibility of topic modeling while being able to assign each observation to a unique cluster. We introduce a greedy clustering algorithm, where inference and clustering are jointly done by mixing a classification variational expectation maximization algorithm, with a branch & bound like strategy on a variational lower bound. An integrated classification likelihood criterion is derived for model selection, and a thorough study with numerical experiments is proposed to assess both the performance and robustness of the method. Finally, we illustrate the qualitative interest of the latter in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model based clustering of multinomial count data

Article Open access 05 July 2023

Clustering Count Data with Stochastic Expectation Propagation

Exponential family mixed membership models for soft clustering of multivariate data

Article 09 August 2016

Notes

https://github.com/nicolasJouvin/MoMPCA.
Available on the CRAN.
In-situ cancers are pre-invasive lesions that get their name from the fact that they have not yet started to spread. Invasive cancer tissues can contain both invasive and in-situ lesions in the same slide.

References

Aggarwal CC, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, New York, pp 77–128
Chapter Google Scholar
Akaike H (1998) Information theory and an extension of the maximum likelihood principle. Selected papers of hirotugu akaike. Springer, New York, pp 199–213
Chapter Google Scholar
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11(10):R106
Article Google Scholar
Banfield JD, Raftery AE (1993) Model-based gaussian and non-gaussian clustering. Biometrics 803–821
Bergé LR, Bouveyron C, Corneli M, Latouche P (2019) The latent topic block model for the co-clustering of textual interaction data. Comput Stat Data Anal
Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725
Article Google Scholar
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational inference: a review for statisticians. J Am Stat Assoc 112(518):859–877
Article MathSciNet Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Bouveyron C, Celeux G, Murphy TB, Raftery AE (2019) Model-based clustering and classification for data science: with applications in R. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge
Bouveyron C, Girard S, Schmid C (2007) High-dimensional data clustering. Comput Stat Data Anal 52(1):502–519
Article MathSciNet Google Scholar
Bouveyron C, Latouche P, Zreik R (2018) The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat Comput 28(1):11–31
Article MathSciNet Google Scholar
Bui QV, Sayadi K, Amor SB, Bui M (2017) Combining latent dirichlet allocation and k-means for documents clustering: effect of probabilistic based distance measures. In: Asian conference on intelligent information and database systems. Springer, New York, pp 248–257
Buntine W (2002) Variational extensions to em and multinomial pca. In: European conference on machine learning. Springer, New York, pp 23–34
Buntine WL, Perttu S (2003) Is multinomial pca multi-faceted clustering or dimensionality reduction? In AISTATS
Carel L, Alquier P (2017) Simultaneous dimension reduction and clustering via the nmf-em algorithm. arXiv preprint arXiv:1709.03346
Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Article MathSciNet Google Scholar
Chien J-T, Lee C-H, Tan Z-H (2017) Latent dirichlet mixture model. Neurocomputing
Chiquet J, Mariadassou M, Robin S et al (2018) Variational inference for probabilistic poisson pca. Ann Appl Stat 12(4):2674–2698
Article MathSciNet Google Scholar
Cunningham RB, Lindenmayer DB (2005) Modeling count data of rare species: some statistical issues. Ecology 86(5):1135–1142
Article Google Scholar
Daudin J-J, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183
Article MathSciNet Google Scholar
Defossez G, Le Guyader-Peyrou S, Uhry Z, Grosclaude P, Remontet L, Colonna M, Dantony E, Delafosse P, Molinié F, Woronoff A-S, et al (2019) Estimations nationales de l’incidence et de la mortalité par cancer en france métropolitaine entre 1990 et 2018. Résultats préliminaires. Saint-Maurice (Fra): Santé publique France
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc: Ser B (Methodol) 39(1):1–22
MathSciNet MATH Google Scholar
Ding C, Li T, Peng W (2008) On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput Stat Data Anal 52(8):3913–3927
Article MathSciNet Google Scholar
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218
Article Google Scholar
Ellis IO, Elston CW (2006) Histologic grade. Breast pathology. Elsevier, Amsterdam, pp 225–233
Chapter Google Scholar
Fordyce JA, Gompert Z, Forister ML, Nice CC (2011) A hierarchical bayesian approach to ecological count data: a flexible tool for ecologists. PLoS ONE 6(11):e26785
Article Google Scholar
Hartigan JA (1975) Clustering algorithms. Wiley, Hoboken
MATH Google Scholar
Hoffman M, Bach FR, Blei DM (2010) Online learning for latent dirichlet allocation. Adv Neural Inf Process Syst 856–864
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 289–296
Hornik K, Grün B (2011) topicmodels: an r package for fitting topic models. J Stat Softw 40(13):1–30
Google Scholar
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417
Article Google Scholar
Lakhani SR (2012) WHO classification of tumours of the breast. International Agency for Research on Cancer
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2. IEEE, pp 2169–2178
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788
Article Google Scholar
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 556–562
Liu L, Tang L, Dong W, Yao S, Zhou W (2016) An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5(1):1608
Article Google Scholar
Mattei P-A, Bouveyron C, Latouche P (2016) Globally sparse probabilistic pca. Artif Intell Stat 976–984
McLachlan G, Peel D (2000) Finite mixture models. Willey Series in Probability and Statistics
Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc: Seri A (Gen) 135(3):370–384
Google Scholar
Osborne J (2005) Notes on the use of data transformations. Pract Assess Res Evalu 9(1):42–50
Google Scholar
O’hara RB, Kotze DJ (2010) Do not log-transform count data. Methods Ecol Evol 1(2):118–122
Article Google Scholar
R Core Team (2019) R: a language and environment for statistical computing organization. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, volume 242, Piscataway, pp 133–142
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Article Google Scholar
Rau A, Celeux G, Martin-Magniette M-L, Maugis-Rabusseau C (2011) Clustering high-throughput sequencing data with Poisson mixture models. Research Report RR-7786, INRIA
Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280
Article Google Scholar
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet Google Scholar
Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv preprint arXiv:1409.7419
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron J, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S et al (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Nat Acad Sci USA 100(14):8418–8423
Article Google Scholar
St-Pierre AP, Shikon V, Schneider DC (2018) Count data in biology-data transformation or model reformation? Ecol Evol 8(6):3077–3085
Article Google Scholar
Tipping ME, Bishop CM (1999a) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482
Article Google Scholar
Tipping ME, Bishop CM (1999b) Probabilistic principal component analysis. J R Stat Soc: Ser B (Stat Methodol) 61(3):611–622
Article MathSciNet Google Scholar
Wallach HM (2008) Structured topic models for language. PhD thesis, University of Cambridge
Watanabe K, Akaho S, Omachi S, Okada M (2010) Simultaneous clustering and dimensionality reduction using variational bayesian mixture model. Classification as a tool for research. Springer, New York, pp 81–89
Chapter Google Scholar
Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the 30th conference on uncertainty in artificial intelligence
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273
Yu S, Yu K, Tresp V, Kriegel H-P (2005) A probabilistic clustering-projection model for discrete data. European conference on principles of data mining and knowledge discovery. Springer, New York, pp 417–428
Google Scholar
Zwiener I, Frisch B, Binder H (2014) Transforming rna-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9(1):e85150
Article Google Scholar

Download references

Acknowledgements

This work was supported by a DIM Math Innov grant from Région Ile-de-France. This work has also been supported by the French government through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002. We are thankful for the support from fédération F2PM, CNRS FR 2036, Paris. Finally, we would like to thank the anonymous reviewers for their helpful comments which contributed to improve the paper.

Author information

Authors and Affiliations

Université Paris 1 Panthéon Sorbonne, Laboratoire SAMM, EA 4543, Paris, France
Nicolas Jouvin
Université de Paris, MAP 5, UMR 8145, Paris, France
Nicolas Jouvin & Pierre Latouche
Université Côte d’Azur, Inria, CNRS, Laboratoire J.A. Dieudonné, Maasai Team, Nice, France
Charles Bouveyron
Pôle de médecine diagnostique et théranostique, Institut Curie, 26 rue d’Ulm, 75005, Paris, France
Guillaume Bataillon
Institut Curie, Direction des Data, 25 rue d’Ulm, 75005, Paris, France
Alain Livartowski

Authors

Nicolas Jouvin
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Latouche
View author publications
You can also search for this author in PubMed Google Scholar
Charles Bouveyron
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Bataillon
View author publications
You can also search for this author in PubMed Google Scholar
Alain Livartowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Jouvin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Proofs

1.1 Constructing meta-observation

Proof of Proposition 1

$$\begin{aligned} {{\,\mathrm{p}\,}}(X, \theta \mid Y, \, \beta )&= {{\,\mathrm{p}\,}}(\theta ) \times {{\,\mathrm{p}\,}}(X\mid \theta , Y) ,\\&= \prod _{q^\prime } {{\,\mathrm{p}\,}}(\theta _{q^\prime }) \times \prod _i \prod _q \prod _n {\mathcal {M}}_V(w_{in}, \, 1 , \,\beta \theta _q)^{Y_{iq}} , \\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _i \prod _v \prod _n (\beta _{v,\cdot } \theta _q)^{ Y_{iq} w_{inv}} ,\\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _v \prod _i (\beta _{v,\cdot } \theta _q)^{ Y_{iq} x_{iv}} ,\\&= \prod _q {{\,\mathrm{p}\,}}(\theta _q) \prod _v (\beta _{v,\cdot } \theta _q)^{\sum _i Y_{iq} x_{iv}} , \end{aligned}$$

since $x_{iv} = \sum _n w_{inv}$. Then, put

$$\begin{aligned} \tilde{X}_q(Y) = \sum _{i=1}^N Y_{iq} x_{i}\, \end{aligned}$$

and this completes the proof of Proposition 1. $\square $

1.2 Derivation of the lower bound

Lower bound and Proposition 2

The bound of Eq. (14) follows from standard derivation of the evidence lower bound in variational inference. Since the $\log $ is concave, by Jensen inequality:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y \mid \pi , \beta )&= \log \sum _Z \int _{\theta } {{\,\mathrm{p}\,}}(X, Y, \theta , Z \mid \pi , \beta ) \mathrm{d}\theta ,\\&= \log \sum _Z \int _{\theta } \frac{{{\,\mathrm{p}\,}}(X, Y, \theta , Z \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z, \theta ) } {{\,\mathrm{\mathcal {R}}\,}}(Z, \theta ) \mathrm{d}\theta ,\\&= \log \left( {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \frac{{{\,\mathrm{p}\,}}(X, Y, Z, \theta \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] \right) \\&\ge {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \log \frac{{{\,\mathrm{p}\,}}(X, Y, Z, \theta \mid \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] ,\\&:= {\mathcal {L}}({{\,\mathrm{\mathcal {R}}\,}}(\cdot ); \, \pi , \beta , Y) . \end{aligned}$$

Moreover, the difference between the classification log-likelihood and its bound is exactly the KL divergence between approximate posterior ${{\,\mathrm{\mathcal {R}}\,}}(\cdot )$ and the true one:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y \mid \pi , \beta ) - {\mathcal {L}}({{\,\mathrm{\mathcal {R}}\,}}(\cdot ); \, \pi , \beta , Y)&= - {\mathbb {E}}_{{{\,\mathrm{\mathcal {R}}\,}}}\left[ \log \frac{{{\,\mathrm{p}\,}}(Z, \theta \mid X, Y, \pi , \beta )}{{{\,\mathrm{\mathcal {R}}\,}}(Z,\theta )}\right] . \end{aligned}$$

Furthermore, the complete expression is given in Proposition 2 as:

where

$$\begin{aligned}&{\mathcal {J}}_{\text {LDA}}^{(q)}( {{\,\mathrm{\mathcal {R}}\,}};\, \beta , \tilde{X}_q(Y))\nonumber \\&\qquad = \log \varGamma (\textstyle \sum _{k=1}^{K} \alpha _k) - \sum _{k=1}^{K}\log \varGamma (\alpha _k) \nonumber \\&\qquad + \sum _{k=1}^{K} (\alpha _k - 1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad + \sum _{i=1}^N Y_{iq} \sum _{k=1}^K \sum _{n=1}^{L_i} \phi _{ink} \left[ \psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql}) + \sum _{v=1}^{V} w_{inv} \log (\beta _{vk})\right] \nonumber \\&\qquad - \log \varGamma (\textstyle \sum _{k=1}^{K} \gamma _{qk}) - \sum _{k=1}^{K}\log \varGamma (\gamma _{qk}) \nonumber \\&\qquad - \sum _{k=1}^{K} (\gamma _{qk} - 1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad - \sum _{k=1}^K (\gamma _{qk} -1) (\psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql})) \nonumber \\&\qquad - \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \phi _{ink} \log (\phi _{ink}) . \end{aligned}$$

(17)

$\square $

1.3 Optimization of ${{\,\mathrm{\mathcal {R}}\,}}(Z)$

Proof of Proposition 3

A classical result about mean field inference, see Blei et al. (2017), states that at the optimum, considering all other distributions fixed:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(z_ {in})&= {\mathbb {E}}_{Z^{ \setminus i, n}, \theta } \left[ \log {{\,\mathrm{p}\,}}(X, Z, \theta \mid Y)\right] + {{\,\mathrm{const}\,}}, \end{aligned}$$

where the expectation is taken with respect to all $Z$ except $z_{in}$, and to all $\theta $, assuming $(Z, \theta ) \sim {{\,\mathrm{\mathcal {R}}\,}}$. Developing the latter leads to:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(z_ {in})&= \sum _{k=1}^{K} z_{ink} \left[ \sum _{v=1}^{V} w_{inv} \log (\beta _{vk}) + \psi (\gamma _{qk}) - \psi (\textstyle \sum _{l=1}^K \gamma _{ql}) \right] + {{\,\mathrm{const}\,}}. \end{aligned}$$

(18)

Equation (18) characterizes the log density of a multinomial:

$$\begin{aligned} {{\,\mathrm{\mathcal {R}}\,}}(z_{in}) = {\mathcal {M}}_K(z_{in}; \, 1, \,\phi _{in} = (\phi _{in1}, \ldots , \phi _{inK})), \end{aligned}$$

where the quantity inside brackets represents the logarithm of the parameter, modulo the normalizing constant. Hence,

$$\begin{aligned} \forall k, \quad \phi _{ink} \propto \left( \prod _{v=1}^V \beta _{vk}^{w_{inv}} \right) \, \prod _{q=1}^Q \exp \left\{ \psi (\gamma _{qk}) - \psi \left( \textstyle \sum _{l=1}^K \gamma _{ql}\right) \right\} ^{Y_{iq}} . \end{aligned}$$

$\square $

1.4 Optimization of ${{\,\mathrm{\mathcal {R}}\,}}(\theta )$

Proof of Proposition 4

With the same reasoning, the optimal form of ${{\,\mathrm{\mathcal {R}}\,}}(\theta )$ is:

$$\begin{aligned} \log {{\,\mathrm{\mathcal {R}}\,}}(\theta )&= {\mathbb {E}}_{Z}\left[ {{\,\mathrm{p}\,}}(X, Z, \theta \mid Y) \right] \, + \, {{\,\mathrm{const}\,}}\nonumber , \\&= \sum _{q=1}^{Q} \left[ \sum _{k=1}^{K} (\alpha _k - 1) \log (\theta _{qk}) + \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \sum _{k=1}^{K} \phi _{ink} \log (\theta _{qk}) \right] + \, {{\,\mathrm{const}\,}}, \nonumber \\&= \sum _{q=1}^{Q}\sum _{k=1}^{K} \left[ \alpha _k + \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \phi _{ink} - 1 \right] \log (\theta _{qk}) \, + \, {{\,\mathrm{const}\,}}. \end{aligned}$$

(19)

Once again, a specific functional form appears as the log of a product of Q independent Dirichlet densities. Then,

$$\begin{aligned} {{\,\mathrm{\mathcal {R}}\,}}(\theta ) = \prod _{q=1}^{Q} {{\,\mathrm{\mathcal {D}}\,}}_K\left( \theta _q; \, \gamma _q=(\gamma _{q1}, \ldots , \gamma _{qK})\right) , \end{aligned}$$

with the Dirichlet parameters inside the brackets of Eq. (19):

$$\begin{aligned} \forall (q,k), \quad \gamma _{qk} = \alpha _k + \sum _{i=1}^{N} Y_{iq}\sum _{n=1}^{L_i} \phi _{ink} . \end{aligned}$$

$\square $

1.5 Optimization of $\beta $

Proof of Proposition 5 (I)

This a constrained maximization problem with K constraints $\sum _{v=1}^{V} \beta _{vk} = 1$. Isolating terms of Eq. (17) depending on $\beta $, and denoting constraints multipliers as $(\lambda _k)_k$, the Lagrangian can be written:

$$\begin{aligned} f(\beta , \lambda ) =&\sum _{q=1}^{Q} \sum _{i=1}^{N} Y_{iq} \sum _{n=1}^{L_i} \sum _{v=1}^{V} \phi _{ink} w_{inv} \log (\beta _{vk}) + \sum _{k=1}^{K} \lambda _k (\beta _{vk} - 1) , \\ =&\sum _{i=1}^{N} \sum _{n=1}^{L_i} \sum _{v=1}^{V} \phi _{ink} w_{inv} \log (\beta _{vk}) + \sum _{k=1}^{K} \lambda _k (\beta _{vk} - 1) . \end{aligned}$$

Setting its derivative to 0 leaves:

$$\begin{aligned} \beta _{vk} \propto \sum _{i=1}^{N} \sum _{n=1}^{L_i} \phi _{ink} \, w_{inv} . \end{aligned}$$

$\square $

1.6 Optimization of $\pi $

Proof of Proposition 5 (II)

The bound depends on $\pi $ only through its clustering term:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(Y \mid \pi ) = \sum _{i=1}^{N}\sum _{q=1}^{Q} Y_{iq} \log (\pi _q) . \end{aligned}$$

Once again, this is a constrained optimization problem, and, introducing the Lagrange multiplier $\lambda $ associated to the constraint $\textstyle \sum _{q=1}^{Q} \pi _q = 1$, we get:

$$\begin{aligned} \sum _{q=1}^{Q} \sum _{i=1}^{N} Y_{iq} \log (\pi _q) + \lambda (\textstyle \sum _{q=1}^{Q} \pi _q - 1) . \end{aligned}$$

Setting the derivative with respect to $\pi _q$ to 0, we get:

$$\begin{aligned} \pi _q = \frac{\sum _{i=1}^{N} Y_{iq}}{N} . \end{aligned}$$

$\square $

1.7 Model selection

Proof of Proposition 6

Assuming that the parameters $(\pi , \beta )$ follows a prior distribution that factorizes as follow:

$$\begin{aligned} {{\,\mathrm{p}\,}}(\pi , \beta \mid Q, K) = {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \, {{\,\mathrm{p}\,}}(\beta \mid K), \end{aligned}$$

(20)

where

$$\begin{aligned} {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) ={\mathcal {D}}_K(\pi ; \, \eta {\mathbf {1}}_Q) . \end{aligned}$$

(21)

Then, the classification log-likelihood is written:

$$\begin{aligned} \log {{\,\mathrm{p}\,}}(X, Y\mid Q, K)= & {} \log \int _{\pi } \int _{\beta }{{\,\mathrm{p}\,}}(X,Y, \beta , \pi \mid Q, K) \, \mathrm{d}\pi \, \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } \int _{\beta }{{\,\mathrm{p}\,}}(X,Y \mid \beta , \pi , \, Q, K) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \, {{\,\mathrm{p}\,}}(\beta \mid K) \, \mathrm{d}\pi \, \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \, \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta \nonumber \\= & {} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \nonumber \\&+ \log \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta . \end{aligned}$$

(22)

The first term in Eq. (22) is exact by Dirichlet-Multinomial conjugacy. Setting $\eta =\frac{1}{2}$ plus a Stirling approximation on the Gamma function as in Daudin et al. (2008) leads to:

$$\begin{aligned} \log \int _{\pi } {{\,\mathrm{p}\,}}(Y \mid \pi ) {{\,\mathrm{p}\,}}(\pi \mid Q, \eta ) \mathrm{d}\pi \approx \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) - \frac{Q-1}{2} \log (D) . \end{aligned}$$

(23)

As for the second term, a BIC-like approximation as in Bouveyron et al. (2018) gives:

$$\begin{aligned} \log \int _{\beta }{{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) {{\,\mathrm{p}\,}}(\beta \mid K) \mathrm{d}\beta \approx \max \limits _{\beta } \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) - \frac{K (V-1)}{2} \log (Q). \end{aligned}$$

In practice, $ \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) $ is still intractable, hence we replace it by its variational approximation after convergence of the VEM, ${\mathcal {J}}^\star _{\text {LDA}}$, which is the sum of the meta-observations individual LDA-bounds detailed in Eq. (17) (different from ${\mathcal {L}}$). In the end, it gives the following criterion:

$$\begin{aligned} {{\,\mathrm{ICL}\,}}(Q, K, Y, X)= & {} {\mathcal {J}}^\star _{\text {LDA}}({{\,\mathrm{\mathcal {R}}\,}}; \, \beta , Y) - \frac{K (V-1)}{2} \log (Q) \nonumber \\&+ \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) - \frac{Q-1}{2} \log (D) . \end{aligned}$$

(24)

Note that:

$$\begin{aligned} \max \limits _{\beta } \log {{\,\mathrm{p}\,}}(X\mid Y, \beta , Q, K) + \max \limits _{\pi } \log {{\,\mathrm{p}\,}}(Y \mid \pi , Q) \approx {\mathcal {L}}^\star , \end{aligned}$$

i.e. the bound after Algorithm 1 converges. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jouvin, N., Latouche, P., Bouveyron, C. et al. Greedy clustering of count data through a mixture of multinomial PCA. Comput Stat 36, 1–33 (2021). https://doi.org/10.1007/s00180-020-01008-9

Download citation

Received: 02 August 2019
Accepted: 25 June 2020
Published: 08 July 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00180-020-01008-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Greedy clustering of count data through a mixture of multinomial PCA

Abstract

Access this article

Similar content being viewed by others

Model based clustering of multinomial count data

Clustering Count Data with Stochastic Expectation Propagation

Exponential family mixed membership models for soft clustering of multivariate data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Proofs

1.1 Constructing meta-observation

Proof of Proposition 1

1.2 Derivation of the lower bound

Lower bound and Proposition 2

1.3 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(Z)\)

Proof of Proposition 3

1.4 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\)

Proof of Proposition 4

1.5 Optimization of \(\beta \)

Proof of Proposition 5 (I)

1.6 Optimization of \(\pi \)

Proof of Proposition 5 (II)

1.7 Model selection

Proof of Proposition 6

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Greedy clustering of count data through a mixture of multinomial PCA

Abstract

Access this article

Similar content being viewed by others

Model based clustering of multinomial count data

Clustering Count Data with Stochastic Expectation Propagation

Exponential family mixed membership models for soft clustering of multivariate data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Proofs

Proofs

1.1 Constructing meta-observation

Proof of Proposition 1

1.2 Derivation of the lower bound

Lower bound and Proposition 2

1.3 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(Z)\)

Proof of Proposition 3

1.4 Optimization of \({{\,\mathrm{\mathcal {R}}\,}}(\theta )\)

Proof of Proposition 4

1.5 Optimization of \(\beta \)

Proof of Proposition 5 (I)

1.6 Optimization of \(\pi \)

Proof of Proposition 5 (II)

1.7 Model selection

Proof of Proposition 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation