Mutual information, phi-squared and model-based co-clustering for contingency tables

Govaert, Gérard; Nadif, Mohamed

doi:10.1007/s11634-016-0274-6

Mutual information, phi-squared and model-based co-clustering for contingency tables

Regular Article
Published: 16 November 2016

Volume 12, pages 455–488, (2018)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Gérard Govaert¹ &
Mohamed Nadif²

594 Accesses
31 Citations
1 Altmetric
Explore all metrics

Abstract

Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

References

Ailem M, Role F, Nadif M (2016) Graph modularity maximization as an effective method for co-clustering text data. Knowl Based Syst 109:160–173
Article Google Scholar
Arabie P, Hubert LJ (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20:268–274
Article Google Scholar
Arabie P, Schleutermann S, Daws J, Hubert L (1988) Marketing applications of sequencing and partitioning of nonsymmetric and/or two-mode matrices. In: Data, expert knowledge and decisions. Springer, pp 215–224
Baier D, Gaul W, Schader M (1997) Two-mode overlapping clustering with applications to simultaneous benefit segmentation and market structuring. In: Classification and knowledge organization. Springer, pp 557–566
Benzecri JP (1973) L’analyse des données, tome 2: l’analyse des correspondances. Dunod, Paris
MATH Google Scholar
Bock HH (1979) Simultaneous clustering of objects and variables. In: Tomassone R (ed) Analyse des Données et Informatique. INRIA, Le Chesnay, pp 187–203
Google Scholar
Bock HH (1992) A clustering technique for maximizing $\varphi $-divergence, noncentrality and discriminating power. In: Analyzing and modeling data and knowledge. Springer, pp 19–36
Bock HH (1994) Information and entropy in cluster analysis. In: Bozdogan H (ed) First US/Japan conference on the frontiers of statistical modeling: an informational approach. Kluwer Academic Publishers, Dordrecht, pp 115–147
Chapter Google Scholar
Bock HH (2004) Convexity-based clustering criteria: theory, algorithms, and applications in statistics. Stat Methods Appl 12(3):293–317
MathSciNet MATH Google Scholar
Bryant PG (1988) On characterizing optimization-based clustering criteria. J Classif 5:81–84
Article Google Scholar
Castillo W, Trejos J (2002) Two-mode partitioning: review of methods and application of tabu search. In: Bock HH (ed) Classification, clustering, and data analysis. Springer, Heidelberg, pp 43–51
Chapter Google Scholar
Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332
Article MathSciNet Google Scholar
Cheng Y, Church GM (2000) Biclustering of expression data. In: ISMB2000, 8th international conference on intelligent systems for molecular biology, vol 8, pp 93–103
Cho H, Dhillon I (2008) Coclustering of human cancer microarrays using minimum sum-squared residue coclustering. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):385–400
Article Google Scholar
Cramer H (1946) Mathematical methods of statistics. Princeton University Press, Princeton
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38
MathSciNet MATH Google Scholar
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD ’01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274
Chapter Google Scholar
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1–2):143–175
Article Google Scholar
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2003), pp 89–98
Ding C, He X, Simon H (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: SIAM data mining conference
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, p 135
Duffy DE, Quiroz AJ (1991) A permutation-based algorithm for block clustering. J Classif 8:65–91
Article MathSciNet Google Scholar
Govaert G (1977) Algorithme de classification d’un tableau de contingence. First international symposium on data analysis and informatics. INRIA, Versailles, pp 487–500
Google Scholar
Govaert G (1983) Classification croisée. Thèse d’état, Université Paris 6, France
Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458
MATH Google Scholar
Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36:463–473
Article Google Scholar
Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647
Article Google Scholar
Govaert G, Nadif M (2007) Clustering of contingency table and mixture model. Eur J Oper Res 183(3):1055–1066
Article MathSciNet Google Scholar
Govaert G, Nadif M (2008) Block clustering with Bernoulli mixture models: comparison of different approaches. Comput Stat Data Anal 52(6):3233–3245
Article MathSciNet Google Scholar
Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425
Article MathSciNet Google Scholar
Govaert G, Nadif M (2013) Co-clustering. Wiley, New York
Book Google Scholar
Greenacre M (1988) Clustering the rows and columns of a contingency table. J Classif 5:39–51
Article MathSciNet Google Scholar
Gupta N, Aggarwal S (2010) Mib: using mutual information for biclustering gene expression data. Pattern Recognit 43(8):2692–2697
Article Google Scholar
Hanczar B, Nadif M (2011) Using the bagging approach for biclustering of gene expression data. Neurocomputing 74(10):1595–1605
Article Google Scholar
Hanczar B, Nadif M (2012) Ensemble methods for biclustering tasks. Pattern Recognit 45(11):3938–3949
Article Google Scholar
Hanczar B, Nadif M (2013) Precision-recall space to correct external indices for biclustering. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 136–144
Harris RR, Kanji GK (1983) On the use of minimum chi-square estimation. The Statistician, pp 379–394
Hartigan JA (1972) Direct clustering of a data matrix. JASA 67(337):123–129
Article Google Scholar
Hathaway RJ (1986) Another interpretation of the em algorithm for mixture distributions. Stat Probab Lett 4(2):53–56
Article MathSciNet Google Scholar
Hofmann T (1999) Probabilistic latent semantic indexing. SIGIR ’99: proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 50–57
Chapter Google Scholar
Labiod L, Nadif M (2011a) Co-clustering for binary and categorical data with maximum modularity. In: 2011 IEEE 11th international conference on data mining, pp 1140–1145
Labiod L, Nadif M (2011b) Co-clustering under nonnegative matrix tri-factorization. In: Neural information processing—18th international conference. ICONIP, pp 709–717
Labiod L, Nadif M (2015) A unified framework for data visualization and coclustering. IEEE Trans Neural Netw Learn Syst 26(9):2194–2199
Article MathSciNet Google Scholar
Li L, Guo Y, Wu W, Shi Y, Cheng J, Tao S (2012) A comparison and evaluation of five biclustering algorithms by quantifying goodness of biclusters for gene expression data. BioData Min 5(1):1
Article Google Scholar
Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. KDD ’05: proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, New York, pp 635–640
Chapter Google Scholar
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 1(1):24–45
Article Google Scholar
Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91
Article Google Scholar
Neal RM, Hinton GE (1998) A view of the em algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. Springer, pp 355–368
Neyman J (1949) Contribution to the theory of Chi-square test. Proceedings of the Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 239–273
Google Scholar
Pearson K (1900) On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond Edinb Dublin Philos Mag J Sci 50(302):157–175
Article Google Scholar
Pötzelberger K, Strasser H (1997) Data compression by unsupervised classification
Pötzelberger K, Strasser H (2001) Clustering and quantization by MSP-partitions. Stat Decis Int J Stoch Methods Models 19(4):331–372
MathSciNet MATH Google Scholar
Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003
Article MathSciNet Google Scholar
Santamaría R, Quintales L, Therón R (2007) Methods to bicluster validation and comparison in microarray data. In: Intelligent data engineering and automated learning-IDEAL 2007. Springer, pp 780–789
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Tanay A, Sharan R, Shamir R (2005) Biclustering algorithms: a survey. Handb Comput Mol Biol 9(1–20):122–124
Google Scholar
Trejos J, Castillo W (2000) Simulated annealing optimization for two-mode partitioning. In: Decker R, Gaul W (eds) Classification and information processing at the turn of the millennium. Springer, Heidelberg, pp 135–142
Chapter Google Scholar
Van Mechelen I, Schepers J (2006) A unifying model for biclustering. In: Compstat 2006-proceedings in computational statistics. Springer, pp 81–88
Van Mechelen I, Bock HH, De Boeck P (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13(5):363–394
Article MathSciNet Google Scholar
Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Heidelberg, pp 43–52
Google Scholar
Windham MP (1987) Parameter modification for clustering criteria. J Classif 4:191–214
Article MathSciNet Google Scholar

Download references

Acknowledgements

We thank the referees and editors for valuable suggestions. We acknowledge support funded by AAP Sorbonne Paris Cité.

Author information

Authors and Affiliations

U.M.R. C.N.R.S., 7253 Heudiasyc, U.T.C, 60205, Compiègne, France
Gérard Govaert
LIPADE, University of Paris Descartes, 75006, Paris, France
Mohamed Nadif

Authors

Gérard Govaert
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Nadif.

Appendices

Appendix A: Proof of Proposition 1

Denoting $e_{ij}=p_{ij}-p_{i.} p_{.j}$ $\forall i,j$ and using the relation $\log (1+x)=x-x^2/2 + O(x^2)$, it can be written $\log (1+\frac{e_{ij}}{p_{i.}p_{.j}})=\frac{e_{ij}}{p_{i.}p_{.j}} - \frac{1}{2} \left( \frac{e_{ij}}{p_{i.}p_{.j}}\right) ^2 + O(e_{ij}^2)$ for $e_{ij} \rightarrow 0$ and

$$\begin{aligned} (p_{ij} +e_{ij})\log \left( 1+\frac{e_{ij}}{p_{i.}p_{.j}}\right)= & {} e_{ij} - \frac{1}{2} \frac{e_{ij}^2}{p_{i.}p_{.j}} + \frac{e_{ij}^2}{p_{i.}p_{.j}} -\frac{1}{2} \frac{e_{ij}^3}{p_{i.}^2 p_{.j}^2}\\&+\, O(e_{ij}^2) =e_{ij}+ \frac{1}{2} \frac{e_{ij}^2}{p_{i.}p_{.j}} + O(e_{ij}^2)\\ \sum _{i,j} (p_{ij} +e_{ij})\log \left( 1+\frac{e_{ij}}{p_{i.}p_{.j}}\right)= & {} \underbrace{\sum _{i,j} e_{ij}}_{=0} + \frac{1}{2} \sum _{i,j}\frac{e_{ij}^2}{p_{i.}p_{.j}} + O\left( \sum _{i,j}e_{ij}^2\right) \end{aligned}$$

this yields

$$\begin{aligned} \mathcal {I}(P_{IJ})=\frac{1}{2} \Phi ^2(P_{IJ}) + O\left( \sum _{i,j}(p_{ij}-p_{i.}p_{.j})^2\right) \quad \text{ for } \quad (p_{ij}-p_{i.}p_{.j}) \rightarrow 0 \end{aligned}$$

Appendix B: Properties of $P_{KL}^{\mathbf {z}\mathbf {w}}$

$$\begin{aligned} \sum _{k\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}=1 \quad (P_{KL}^{\mathbf {z}\mathbf {w}} \hbox {is a distribution} ) \sum _{\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}=\sum _i z_{ik} p_{i.} \quad \text {and} \quad \sum _{k} p_{k\ell }^{\mathbf {z}\mathbf {w}}=\sum _j w_{j\ell } p_{.j}. \end{aligned}$$

Proof

We have

$$\begin{aligned} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}= \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} =\sum _{ij}p_{ij} \sum _{k,\ell } z_{ik}w_{j\ell }=1\quad \text{ since } \sum _{k,\ell } z_{ik}w_{j\ell }=1 \end{aligned}$$

$$\begin{aligned} \sum _\ell p_{k\ell }^{\mathbf {z}\mathbf {w}}= & {} \sum _{i,j,\ell } z_{ik}w_{j\ell } \; p_{ij} = \sum _i\left( z_{ik} \sum _j \left( p_{jj} \sum _\ell w_{j\ell }\right) \right) \\ {}= & {} \sum _i z_{ik} p_{i.}\quad \text{ since } \sum _\ell w_{j\ell }=1 \end{aligned}$$

$$\begin{aligned} \sum _k p_{k\ell }^{\mathbf {z}\mathbf {w}}= & {} \sum _{i,j,k} z_{ik}w_{j\ell } \; p_{ij} = \sum _j\left( w_{j\ell } \sum _i \left( p_{jj} \sum _k z_{ik}\right) \right) \\ {}= & {} \sum _j w_{j\ell } p_{i.j}\quad \text{ since } \sum _k z_{ik}=1 \end{aligned}$$

$\square $

Appendix C: Properties of $Q_{IJ}^{\mathbf {z}\mathbf {w}}$

$$\begin{aligned} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}}=1, \quad q_{i.}^{\mathbf {z}}=p_{i.} \quad \text {and} \quad q_{.j}^{\mathbf {w}}=p_{.j} \qquad \forall i,j, \end{aligned}$$

Proof

$$\begin{aligned} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}}= \sum _{i,j} p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} =\sum _{k,\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} \underbrace{\sum _{i,j} p_{i.}p_{.j}z_{ik} w_{j\ell }}_{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}=\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}=1 \end{aligned}$$

$$\begin{aligned} q_{i.}^{\mathbf {z}}=\sum _j q_{ij}^{\mathbf {z}\mathbf {w}}= \sum _j p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}= p_{i.} \sum _k \frac{z_{ik}}{p_{k.}} \underbrace{\sum _{\ell } \frac{\overbrace{\sum _j (w_{j\ell } p_{.j} )}^{p_{.\ell }^{\mathbf {w}}} p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{.\ell }^{\mathbf {w}}}}_{=p_{k.}^{\mathbf {z}}}=p_{i.}. \end{aligned}$$

and, symmetrically, $q_{.j}^{\mathbf {w}}=p_{.j}.$ $\square $

Appendix D: Proof of Proposition 2

Lemma 1

$$\begin{aligned} \sum _{k,\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}= \sum _{i,j} \frac{p_{ij} q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}=\sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}} \end{aligned}$$

Proof

$$\begin{aligned} \sum _{i,j} \frac{p_{ij} q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}= & {} \sum _{i,j} p_{ij} \frac{p_{i.}p_{.j} \sum _{k,\ell }z_{ik} w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}}{p_{i.}p_{.j}}= \sum _{k,\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} \sum _{i,j} z_{ik}w_{j\ell } p_{ij}\\ {}= & {} \sum _{k,\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}} \end{aligned}$$

$$\begin{aligned} \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}= & {} \sum _{i,j} \frac{ \left( p_{i.}p_{.j} \sum _{k,\ell } z_{ik}w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) ^2}{p_{i.}p_{.j}} =\sum _{i,j} p_{i.}p_{.j}\sum _{k,\ell } z_{ik}w_{j\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{(p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}})^2}\\= & {} \sum _{k,\ell } \frac{(p_{k\ell }^{\mathbf {z}\mathbf {w}})^2}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}. \end{aligned}$$

$\square $

Proof of Proposition 2

The first equation (a) can easily be deduced from Lemma 1
Second equation (b):
$$\begin{aligned} \Phi ^2(P_{IJ})-\Phi ^2(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \left( \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} -1\right) - \left( \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{q_{i.}^{\mathbf {z}}q_{.j}^{\mathbf {w}}} -1\right) \\= & {} \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}. \end{aligned}$$
Using Lemma 1, this relation can be written
$$\begin{aligned} \sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - \sum _{i,j} \frac{p_{ij}q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} = \sum _{i,j} p_{ij} \left( \frac{p_{ij}}{p_{i.}p_{.j}} - \frac{q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}}\right) \end{aligned}$$
and
$$\begin{aligned}&\sum _{i,j} \frac{p_{ij}^2}{p_{i.}p_{.j}} - 2 \sum _{i,j} \frac{p_{ij}q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} + \sum _{i,j} \frac{(q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}=\sum _{i,j} \frac{(p_{ij}-q_{ij}^{\mathbf {z}\mathbf {w}})^2}{p_{i.}p_{.j}}\\&\quad =\chi ^2(P_{IJ}||Q_{IJ}^{\mathbf {z}\mathbf {w}})\ge 0. \end{aligned}$$
The two inequalities (c) can easily be deduced from the previous relation.

Appendix E: Proof of Proposition 3

First equation (d):
$$\begin{aligned} \mathcal {I}(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \sum _{i,j} q_{ij}^{\mathbf {z}\mathbf {w}} \log \frac{q_{ij}^{\mathbf {z}\mathbf {w}}}{p_{i.}p_{.j}} =\sum _{i,j}\left( p_{i.}p_{.j}\left( \sum _{k,\ell }z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right. \\&\left. \times \log \left( \sum _{k,\ell } z_{ik}w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right) \\= & {} \sum _{i,j}\left( p_{i.}p_{.j}\left( \sum _{k,\ell }z_{ik} w_{j\ell }\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {w}}p_{.\ell }^{\mathbf {w}}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\right) \right) \\= & {} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \mathcal {I}(P_{KL}^{\mathbf {z}\mathbf {w}}). \end{aligned}$$
Second equation (e):
$$\begin{aligned} \mathcal {I}(P_{IJ})-\mathcal {I}(Q_{IJ}^{\mathbf {z}\mathbf {w}})= & {} \mathcal {I}(P_{IJ})-\mathcal {I}(P_{KL}^{\mathbf {z}\mathbf {w}}) =\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{i,j,k,\ell }(z_{ik}w_{j\ell } p_{ij}) \log \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} \frac{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}{p_{k\ell }^{\mathbf {z}\mathbf {w}}}\\= & {} \sum _{i,j,k,\ell } z_{ik}w_{j\ell } p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j} \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}}\\= & {} \sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}\sum _{k,\ell } z_{ik} w_{j\ell } \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}}\\= & {} \sum _{i,j} p_{ij} \log \frac{p_{ij}}{q_{ij}^{\mathbf {z}\mathbf {w}}} =\text {KL}(P_{IJ}||Q_{IJ}^{\mathbf {z}\mathbf {w}}) \ge 0. \end{aligned}$$
The two inequalities (f) can easily be deduced from the previous relation.

Appendix: F Properties of $R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}$

$$\begin{aligned} \sum _{i,j} r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=1 \quad r_{i.}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{i.} \quad \text {and} \quad r_{.j}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{.j} \qquad \forall i,j \end{aligned}$$

Proof

$$\begin{aligned} \sum _{ij}r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}= & {} \sum _{i,j}p_{i.}p_{.j}\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell }= \sum _{k,\ell } \delta _{k\ell }\sum _{i,j}p_{i.}p_{.j}z_{ik}w_{j\ell }= \sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1\\ r_{i.}^{\mathbf {z}\mathbf {w}\varvec{\delta }}= & {} \sum _j r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}= \sum _j p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\delta _{k\ell }=p_{i.} \sum _j p_{.j}=p_{i.} \end{aligned}$$

and, symmetrically, $r_{.j}^{\mathbf {z}\mathbf {w}\varvec{\delta }}=p_{.j}.$ $\square $

Appendix G: Proof of Eq. (7)

$$\begin{aligned} \widetilde{W}_{\Phi ^2}(\mathbf {z},\mathbf {w},\varvec{\delta })&=D_{\Phi ^2}(P_{IJ}||R_{KL}^{\mathbf {z}\mathbf {w}\varvec{\delta }}) =\sum _{i,j} \frac{(p_{ij}-r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }})^2}{p_{i.}p_{.j}} \\&=\sum _{i,j} \frac{(p_{ij}-p_{i.}p_{.j}\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell })^2}{p_{i.}p_{.j}} \\&=\sum _{i,j} p_{i.}p_{.j} \left( \frac{p_{ij}}{p_{i.}p_{.j}}-\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell }\right) ^2\\&=\sum _{i,j}p_{i.}p_{.j}\left( \frac{(\sum _{k,\ell }z_{ik}w_{j\ell })p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell }z_{ik}w_{j\ell }\delta _{k\ell }\right) ^2 \end{aligned}$$

and since $z_{ik}w_{j\ell } \in \{0,1\}$, we obtain $\widetilde{W}_{\Phi ^2}(\mathbf {z},\mathbf {w},\varvec{\delta }) =\sum _{i,j,k,\ell } z_{ik} w_{j\ell } \,p_{i.}p_{.j} \left( \frac{p_{ij}}{p_{i.}p_{.j}}-\delta _{k\ell }\right) ^2$

Appendix H: Proof of Eq. (8)

$$\begin{aligned} \widetilde{W}_{\mathcal {I}}(\mathbf {z},\mathbf {w},\varvec{\delta })&=KL(P_{IJ}||R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}) =\sum _{i,j} p_{ij} \log \frac{p_{ij}}{r_{ij}^{\mathbf {z}\mathbf {w}\varvec{\delta }}}\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j} \sum _{k,\ell } z_{ik} w_{j\ell }\delta _{k\ell }}\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} - \sum _{i,j} p_{ij} \log \sum _{k,\ell } z_{ik} w_{j\ell }\delta _{k\ell }\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{i,k} z_{ik} \sum _\ell p_{i\ell }^{\mathbf {w}} \log \delta _{k\ell }\\&=\sum _{i,j} p_{ij} \log \frac{p_{ij}}{p_{i.}p_{.j}} -\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \delta _{k\ell }. \end{aligned}$$

Appendix I: Proof of Proposition 4

Using Eq. (7), the problem can be formulated ${{\mathrm{argmin}}}_{\delta _{k\ell }} F(\delta _{k\ell }) \quad \forall k,\ell $ where

$$\begin{aligned} F(\delta _{k\ell })= & {} \sum _{i,j}z_{ik}w_{j\ell }p_{i.}p_{.j}\left( \frac{p_{ij}}{p_{i.}p_{.j}}-\delta _{k\ell }\right) ^2 =p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}} \delta _{k\ell }^2 -2 p_{k\ell }^{\mathbf {z}\mathbf {w}} \delta _{k\ell }\\&+\sum _{i,j}z_{ik}w_{j\ell } \frac{p_{ij}^2}{p_{i.}p_{.j}} \end{aligned}$$

is a quadratic function of $\delta _{k\ell }$ which has its maximum value for $\delta _{k\ell }=-\frac{- 2 p_{k\ell }^{\mathbf {z}\mathbf {w}}}{2 (p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}})} =\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}.$

Appendix J: Proof of Proposition 5

The problem can be formulated as ${{\mathrm{argmax}}}_{\delta _{k\ell }} \sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}} \log \delta _{k\ell } \ \text {with} \ \sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1.$ We can solve this problem by using the method of Lagrange multiplier, we introduce a new variable $\lambda $ and study the Lagrange function defined by $F(k,\ell )=\sum _{k,\ell } p_{k\ell }^{\mathbf {z}\mathbf {w}}\log \delta _{k\ell } -\lambda (\sum _{k,\ell }p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}\delta _{k\ell }-1).$ Then we have $\forall k,\ell $ $ \frac{\partial F(k,\ell )}{\partial \delta _{k\ell }}=0 \Rightarrow \frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{\delta _{k\ell }} - \lambda p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}=0 \Rightarrow \delta _{k\ell }=\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{\lambda p_{k.}^{\mathbf {z}} p_{.\ell }^{\mathbf {w}}}. $ The constraint $\sum _{k,\ell } p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}} \delta _{k\ell }=1$ yields $\lambda =\sum _{k,\ell }p_{k\ell }^{\mathbf {z}\mathbf {w}}=1$ and therefore we obtain $\delta _{k\ell }=\frac{p_{k\ell }^{\mathbf {z}\mathbf {w}}}{p_{k.}^{\mathbf {z}}p_{.\ell }^{\mathbf {w}}}.$

Appendix K: VEM algorithm

Knowing that $y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}=\sum _{i,j} \widetilde{z}_{ik} \widetilde{w}_{j\ell } \, x_{ij}$, $y_{k.}^{\widetilde{\mathbf {z}}}=\sum _{\ell } y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}$ and $y_{.\ell }^{\widetilde{\mathbf {w}}}=\sum _k y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}$, VEM alternates the following steps

Update of $\varvec{\theta }$: $\pi _k=\frac{\widetilde{z}_{.k}}{n}$, $\rho _\ell =\frac{\widetilde{w}_{.\ell }}{d}$ and $\gamma _{k\ell }=\frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}}}$
Update of $\widetilde{\mathbf {z}}$: $\tilde{z}_{ik} \propto \pi _k \exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell })$
Update of $\widetilde{\mathbf {w}}$: $\tilde{w}_{j\ell } \propto \rho _\ell \exp (\sum _k y_{k.}^{\widetilde{\mathbf {z}}} \log \gamma _{k\ell })$

Proof

Update of $\varvec{\theta }$:
- Equations (16) and (17) lead to ${{\mathrm{argmax}}}_{\varvec{\pi }} \sum _k \widetilde{z}_k \log \pi _k$ and then to $\pi _k=\frac{\widetilde{z}_{.k}}{n}$ $\forall k.$
- Similarly, we have $\rho _\ell =\frac{\widetilde{w}_{.\ell }}{n}\qquad \forall \ell .$
- For $\gamma _{k\ell }$, we have $\forall k,\ell $,
$$\begin{aligned} \hat{\gamma }_{k\ell }= & {} {{\mathrm{argmax}}}_{\gamma _{k\ell }} \sum _{i,j} \widetilde{z}_{ik} \widetilde{w}_{j\ell }(x_{ij} \log \gamma _{k\ell }-x_{i.} x_{.j} \gamma _{k\ell })\\= & {} {{\mathrm{argmax}}}_{\gamma _{k\ell }} \frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{\gamma _{k\ell }}-y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}} =\frac{y_{k\ell }^{\widetilde{\mathbf {z}}\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}}y_{.\ell }^{\widetilde{\mathbf {w}}}}. \end{aligned}$$
Update of $\widetilde{\mathbf {z}}$: Eqs. (16) and (17) lead, for all i, to
$$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}} \sum _k \left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik}\sum _{j,\ell } \widetilde{w}_{j\ell } (x_{ij} \log \gamma _{k\ell } -x_{i.}x_{.j} \gamma _{k\ell }) - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$
$\forall i,k$ under the constraint $\sum _k \widetilde{z}_{ik}=1$. This takes the following form
$$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}}\sum _k\left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik} \sum _\ell (y_{.\ell }^{\mathbf {w}} \log \gamma _{k\ell } - x_{i.} y_{.\ell }^{\mathbf {w}} \gamma _{k\ell }) - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$
and since $\gamma _{k\ell }=\frac{y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}y_{.\ell }^{\widetilde{\mathbf {w}}}}$ where $\widetilde{\mathbf {z}}'$ means the vector of membership values computed above, we obtain
$$\begin{aligned} \sum _\ell x_{i.} y_{.\ell }^{\widetilde{\mathbf {w}}} \gamma _{k\ell }= \sum _\ell x_{i.} y_{.\ell }^{\widetilde{\mathbf {w}}} \frac{y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}y_{.\ell }^{\widetilde{\mathbf {w}}}}=x_{i.} \frac{\sum _\ell y_{k\ell }^{\widetilde{\mathbf {z}}'\widetilde{\mathbf {w}}}}{y_{k.}^{\widetilde{\mathbf {z}}'}}=x_{i.} \end{aligned}$$
which does not depend on k and then
$$\begin{aligned} {{\mathrm{argmax}}}_{\widetilde{z}_{ik}}\sum _k\left( \widetilde{z}_{ik} \log \pi _k + \widetilde{z}_{ik} \sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell } - \widetilde{z}_{ik} \log \widetilde{z}_{ik}\right) \end{aligned}$$
under the constraint $\sum _k \widetilde{z}_{ik}=1$. Using the lagrange multiplier, we obtain
$$\begin{aligned} \tilde{z}_{ik} =\frac{\pi _k \exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell })}{\sum _{k'} \pi _{k'} exp (\sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k'\ell })}\propto \pi _k \exp \left( \sum _\ell y_{.\ell }^{\widetilde{\mathbf {w}}} \log \gamma _{k\ell }\right) \end{aligned}$$
The update of $\widetilde{\mathbf {w}}$ is proven in a similar way.

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Govaert, G., Nadif, M. Mutual information, phi-squared and model-based co-clustering for contingency tables. Adv Data Anal Classif 12, 455–488 (2018). https://doi.org/10.1007/s11634-016-0274-6

Download citation

Received: 16 January 2016
Revised: 11 October 2016
Accepted: 03 November 2016
Published: 16 November 2016
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11634-016-0274-6

Keywords

Mathematics Subject Classification

62-07

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Mutual information, phi-squared and model-based co-clustering for contingency tables

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Proposition 1

Appendix B: Properties of \(P_{KL}^{\mathbf {z}\mathbf {w}}\)

Proof

Appendix C: Properties of \(Q_{IJ}^{\mathbf {z}\mathbf {w}}\)

Proof

Appendix D: Proof of Proposition 2

Lemma 1

Proof

Proof of Proposition 2

Appendix E: Proof of Proposition 3

Appendix: F Properties of \(R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}\)

Proof

Appendix G: Proof of Eq. (7)

Appendix H: Proof of Eq. (8)

Appendix I: Proof of Proposition 4

Appendix J: Proof of Proposition 5

Appendix K: VEM algorithm

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Mutual information, phi-squared and model-based co-clustering for contingency tables

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Proof of Proposition 1

Appendix B: Properties of \(P_{KL}^{\mathbf {z}\mathbf {w}}\)

Proof

Appendix C: Properties of \(Q_{IJ}^{\mathbf {z}\mathbf {w}}\)

Proof

Appendix D: Proof of Proposition 2

Lemma 1

Proof

Proof of Proposition 2

Appendix E: Proof of Proposition 3

Appendix: F Properties of \(R_{IJ}^{\mathbf {z}\mathbf {w}\varvec{\delta }}\)

Proof

Appendix G: Proof of Eq. (7)

Appendix H: Proof of Eq. (8)

Appendix I: Proof of Proposition 4

Appendix J: Proof of Proposition 5

Appendix K: VEM algorithm

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation