Abstract
Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.
Similar content being viewed by others
References
Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173
Arabie P, Hubert LJ (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20:268–274
Arabie P, Schleutermann S, Daws J, Hubert L (1988) Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In: Data, expert knowledge and decisions. Springer, pp 215–224
Baier D, Gaul W, Schader M (1997) Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In: Classification and knowledge organization. Springer, pp 557–566
Benzecri JP (1973) L’analyse des données, tome 2: l’analyse des correspondances. Dunod, Paris
Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203
Bock HH (1992) A clustering technique for maximizing \(\varphi \)-divergence, noncentrality and discriminating power. In: Analyzing and modeling data and knowledge. Springer, pp 19–36
Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H (ed) First US/Japan conference on the frontiers of statistical modeling: an informational approach. Kluwer Academic Publishers, Dordrecht, pp 115–147
Bock HH (2004) Convexity-based clustering criteria: theory, algorithms, and applications in statistics. Stat Methods Appl 12(3):293–317
Bryant PG (1988) On characterizing optimization-based clustering criteria. J Classif 5:81–84
Castillo W, Trejos J (2002) Two-mode partitioning: review of methods and application of tabu search. In: Bock HH (ed) Classification, clustering, and data analysis. Springer, Heidelberg, pp 43–51
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Cheng Y, Church GM (2000) Biclustering of expression data. In: ISMB2000, 8th international conference on intelligent systems for molecular biology, vol 8, pp 93–103
Cho H, Dhillon I (2008) Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):385–400
Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), pp 89–98
Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM data mining conference
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 135
Duffy DE, Quiroz AJ (1991) A permutation-based algorithm for block clustering. J Classif 8:65–91
Govaert G (1977) Algorithme de classification d’un tableau de contingence. First international symposium on data analysis and informatics. INRIA, Versailles, pp 487–500
Govaert G (1983) Classification croisée. Thèse d’état, Université Paris 6, France
Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458
Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473
Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647
Govaert G, Nadif M (2007) Clustering of contingency table and mixture model. Eur J Oper Res 183(3):1055–1066
Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245
Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425
Govaert G, Nadif M (2013) Co-clustering. Wiley, New York
Greenacre M (1988) Clustering the rows and columns of a contingency table. J Classif 5:39–51
Gupta N, Aggarwal S (2010) Mib: using mutual information for biclustering gene expression data. Pattern Recognit 43(8):2692–2697
Hanczar B, Nadif M (2011) Using the bagging approach for biclustering of gene expression data. Neurocomputing 74(10):1595–1605
Hanczar B, Nadif M (2012) Ensemble methods for biclustering tasks. Pattern Recognit 45(11):3938–3949
Hanczar B, Nadif M (2013) Precision-recall space to correct external indices for biclustering. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 136–144
Harris RR, Kanji GK (1983) On the use of minimum chi-square estimation. The Statistician, pp 379–394
Hartigan JA (1972) Direct clustering of a data matrix. JASA 67(337):123–129
Hathaway RJ (1986) Another interpretation of the em algorithm for mixture distributions. Stat Probab Lett 4(2):53–56
Hofmann T (1999) Probabilistic latent semantic indexing. SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 50–57
Labiod L, Nadif M (2011a) Co-clustering for binary and categorical data with maximum modularity. In: 2011 IEEE 11th international conference on data mining, pp 1140–1145
Labiod L, Nadif M (2011b) Co-clustering under nonnegative matrix tri-factorization. In: Neural information processing—18th international conference. ICONIP, pp 709–717
Labiod L, Nadif M (2015) A unified framework for data visualization and coclustering. IEEE Trans Neural Netw Learn Syst 26(9):2194–2199
Li L, Guo Y, Wu W, Shi Y, Cheng J, Tao S (2012) A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Min 5(1):1
Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. KDD ’05: proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, New York, pp 635–640
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45
Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91
Neal RM, Hinton GE (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. Springer, pp 355–368
Neyman J (1949) Contribution to the theory of Chi-square test. Proceedings of the Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 239–273
Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175
Pötzelberger K, Strasser H (1997) Data compression by unsupervised classification
Pötzelberger K, Strasser H (2001) Clustering and quantization by MSP-partitions. Stat Decis Int J Stoch Methods Models 19(4):331–372
Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003
Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 780–789
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9(1–20):122–124
Trejos J, Castillo W (2000) Simulated annealing optimization for two-mode partitioning. In: Decker R, Gaul W (eds) Classification and information processing at the turn of the millennium. Springer, Heidelberg, pp 135–142
Van Mechelen I, Schepers J (2006) A unifying model for biclustering. In: Compstat 2006-proceedings in computational statistics. Springer, pp 81–88
Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394
Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Heidelberg, pp 43–52
Windham MP (1987) Parameter modification for clustering criteria. J Classif 4:191–214
Acknowledgements
We thank the referees and editors for valuable suggestions. We acknowledge support funded by AAP Sorbonne Paris Cité.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of Proposition 1
Denoting \(e_{ij}=p_{ij}-p_{i.} p_{.j}\) \(\forall i,j\) and using the relation \(\log (1+x)=x-x^2/2 + O(x^2)\), it can be written \(\log (1+\frac{e_{ij}}{p_{i.}p_{.j}})=\frac{e_{ij}}{p_{i.}p_{.j}} - \frac{1}{2} \left( \frac{e_{ij}}{p_{i.}p_{.j}}\right) ^2 + O(e_{ij}^2)\) for \(e_{ij} \rightarrow 0\) and
this yields
Appendix B: Properties of \(P_{KL}^{\mathbf {z}\mathbf {w}}\)
Proof
We have
\(\square \)
Appendix C: Properties of \(Q_{IJ}^{\mathbf {z}\mathbf {w}}\)
Proof
and, symmetrically, \(q_{.j}^{\mathbf {w}}=p_{.j}.\) \(\square \)
Appendix D: Proof of Proposition 2
Lemma 1
Proof
\(\square \)
Proof of Proposition 2
-
The first equation (a) can easily be deduced from Lemma 1
-
Second equation (b):
$$\begin{aligned} \Phi ^2(P_{IJ})-\Phi ^2(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \left( \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} -1\right) - \left( \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{q_{i.}^{\mathbf {z}}q_{.j}^{\mathbf {w}}} -1\right) \\= & {} \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}. \end{aligned}$$Using Lemma 1, this relation can be written
$$\begin{aligned} \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - \sum _{i,j} \frac{p_{ij}q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} = \sum _{i,j} p_{ij} \left( \frac{p_{ij}}{p_{i.}p_{.j}} - \frac{q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}\right) \end{aligned}$$and
$$\begin{aligned}&\sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - 2 \sum _{i,j} \frac{p_{ij}q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} + \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}=\sum _{i,j} \frac{(p_{ij}-q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}\\&\quad =\chi ^2(P_{IJ}||Q_{IJ}^{\mathbf {z}\mathbf {w}})\ge 0. \end{aligned}$$ -
The two inequalities (c) can easily be deduced from the previous relation.
Appendix E: Proof of Proposition 3
-
First equation (d):
$$\begin{aligned} \mathcal {I}(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}} \log \frac{q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} =\sum _{i,j}\left( p_{i.}p_{.j}\left( \sum _{k,\ell }z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right. \\&\left. \times \log \left( \sum _{k,\ell } z_{ik}w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right) \\= & {} \sum _{i,j}\left( p_{i.}p_{.j}\left( \sum _{k,\ell }z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {w}}p_{.\ell }^{\mathbf {w}}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right) \\= & {} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \mathcal {I}(P_{KL}^{\mathbf {z}\mathbf {w}}). \end{aligned}$$ -
Second equation (e):
$$\begin{aligned} \mathcal {I}(P_{IJ})-\mathcal {I}(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \mathcal {I}(P_{IJ})-\mathcal {I}(P_{KL}^{\mathbf {z}\mathbf {w}}) =\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{i,j,k,\ell }(z_{ik}w_{j\ell } p_{ij}) \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} \frac{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}{p_{k\ell }^{\mathbf {z}\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j} \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}}\\= & {} \sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}\sum _{k,\ell } z_{ik} w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}}\\= & {} \sum _{i,j} p_{ij} \log \frac{p_{ij}}{q_{ij}^{\mathbf {z}\mathbf {w}}} =\text {KL}(P_{IJ}||Q_{IJ}^{\mathbf {z}\mathbf {w}}) \ge 0. \end{aligned}$$ -
The two inequalities (f) can easily be deduced from the previous relation.
Appendix: F Properties of \(R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}\)
Proof
and, symmetrically, \(r_{.j}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{.j}.\) \(\square \)
Appendix G: Proof of Eq. (7)
and since \(z_{ik}w_{j\ell } \in \{0,1\}\), we obtain \(\widetilde{W}_{\Phi ^2}(\mathbf {z},\mathbf {w},\varvec{\delta }) =\sum _{i,j,k,\ell } z_{ik} w_{j\ell } \,p_{i.}p_{.j} \left( \frac{p_{ij}}{p_{i.}p_{.j}}-\delta _{k\ell }\right) ^2\)
Appendix H: Proof of Eq. (8)
Appendix I: Proof of Proposition 4
Using Eq. (7), the problem can be formulated \({{\mathrm{argmin}}}_{\delta _{k\ell }} F(\delta _{k\ell }) \quad \forall k,\ell \) where
is a quadratic function of \(\delta _{k\ell }\) which has its maximum value for \(\delta _{k\ell }=-\frac{- 2 p_{k\ell }^{\mathbf {z}\mathbf {w}}}{2 (p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}})} =\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}.\)
Appendix J: Proof of Proposition 5
The problem can be formulated as \({{\mathrm{argmax}}}_{\delta _{k\ell }} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \delta _{k\ell } \ \text {with} \ \sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1.\) We can solve this problem by using the method of Lagrange multiplier, we introduce a new variable \(\lambda \) and study the Lagrange function defined by \(F(k,\ell )=\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}\log \delta _{k\ell } -\lambda (\sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}\delta _{k\ell }-1).\) Then we have \(\forall k,\ell \) \( \frac{\partial F(k,\ell )}{\partial \delta _{k\ell }}=0 \Rightarrow \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{\delta _{k\ell }} - \lambda p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}=0 \Rightarrow \delta _{k\ell }=\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{\lambda p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}. \) The constraint \(\sum _{k,\ell } p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1\) yields \(\lambda =\sum _{k,\ell }p_{k\ell }^{\mathbf {z}\mathbf {w}}=1\) and therefore we obtain \(\delta _{k\ell }=\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}.\)
Appendix K: VEM algorithm
Knowing that \(y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}=\sum _{i,j} \widetilde{z}_{ik} \widetilde{w}_{j\ell } \, x_{ij}\), \(y_{k.}^{\widetilde{\mathbf {z}}}=\sum _{\ell } y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}\) and \(y_{.\ell }^{\widetilde{\mathbf {w}}}=\sum _k y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}\), VEM alternates the following steps
-
Update of \(\varvec{\theta }\): \(\pi _k=\frac{\widetilde{z}_{.k}}{n}\), \(\rho _\ell =\frac{\widetilde{w}_{.\ell }}{d}\) and \(\gamma _{k\ell }=\frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}}}\)
-
Update of \(\widetilde{\mathbf {z}}\): \(\tilde{z}_{ik} \propto \pi _k \exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell })\)
-
Update of \(\widetilde{\mathbf {w}}\): \(\tilde{w}_{j\ell } \propto \rho _\ell \exp (\sum _k y_{k.}^{\widetilde{\mathbf {z}}} \log \gamma _{k\ell })\)
Proof
-
Update of \(\varvec{\theta }\):
-
Equations (16) and (17) lead to \({{\mathrm{argmax}}}_{\varvec{\pi }} \sum _k \widetilde{z}_k \log \pi _k\) and then to \(\pi _k=\frac{\widetilde{z}_{.k}}{n}\) \(\forall k.\)
-
Similarly, we have \(\rho _\ell =\frac{\widetilde{w}_{.\ell }}{n}\qquad \forall \ell .\)
-
For \(\gamma _{k\ell }\), we have \(\forall k,\ell \),
$$\begin{aligned} \hat{\gamma }_{k\ell }= & {} {{\mathrm{argmax}}}_{\gamma _{k\ell }} \sum _{i,j} \widetilde{z}_{ik} \widetilde{w}_{j\ell }(x_{ij} \log \gamma _{k\ell }-x_{i.} x_{.j} \gamma _{k\ell })\\= & {} {{\mathrm{argmax}}}_{\gamma _{k\ell }} \frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{\gamma _{k\ell }}-y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}} =\frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}}}. \end{aligned}$$ -
-
Update of \(\widetilde{\mathbf {z}}\): Eqs. (16) and (17) lead, for all i, to
$$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}} \sum _k \left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik}\sum _{j,\ell } \widetilde{w}_{j\ell } (x_{ij} \log \gamma _{k\ell } -x_{i.}x_{.j} \gamma _{k\ell }) - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$\(\forall i,k\) under the constraint \(\sum _k \widetilde{z}_{ik}=1\). This takes the following form
$$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}}\sum _k\left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik} \sum _\ell (y_{.\ell }^{\mathbf {w}} \log \gamma _{k\ell } - x_{i.} y_{.\ell }^{\mathbf {w}} \gamma _{k\ell }) - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$and since \(\gamma _{k\ell }=\frac{y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}y_{.\ell }^{\widetilde{\mathbf {w}}}}\) where \(\widetilde{\mathbf {z}}'\) means the vector of membership values computed above, we obtain
$$\begin{aligned} \sum _\ell x_{i.} y_{.\ell }^{\widetilde{\mathbf {w}}} \gamma _{k\ell }= \sum _\ell x_{i.} y_{.\ell }^{\widetilde{\mathbf {w}}} \frac{y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}y_{.\ell }^{\widetilde{\mathbf {w}}}}=x_{i.} \frac{\sum _\ell y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}}=x_{i.} \end{aligned}$$which does not depend on k and then
$$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}}\sum _k\left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik} \sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell } - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$under the constraint \(\sum _k \widetilde{z}_{ik}=1\). Using the lagrange multiplier, we obtain
$$\begin{aligned} \tilde{z}_{ik} =\frac{\pi _k \exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell })}{\sum _{k'} \pi _{k'} exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k'\ell })}\propto \pi _k \exp \left( \sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell }\right) \end{aligned}$$ -
The update of \(\widetilde{\mathbf {w}}\) is proven in a similar way.
\(\square \)
Rights and permissions
About this article
Cite this article
Govaert, G., Nadif, M. Mutual information, phi-squared and model-based co-clustering for contingency tables. Adv Data Anal Classif 12, 455–488 (2018). https://doi.org/10.1007/s11634-016-0274-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-016-0274-6