Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit

Conde-Céspedes, Patricia; Marcotorchino, Jean-François; Viennet, Emmanuel

doi:10.1007/978-3-319-45763-5_6

Patricia Conde-Céspedes⁵,
Jean-François Marcotorchino⁶ &
Emmanuel Viennet⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 665))

680 Accesses
2 Citations

Abstract

The modularization of large graphs or community detection in networks is usually approached as an optimization problem of a quality function or criterion, for instance, the modularity of Newman-Girvan. There exist other clustering criteria, with their own properties leading to different solutions. In this paper we present six linear modularization criteria in relational notation such as the Newman-Girvan modularity, Zahn-Condorcet, Owsiński-Zadrożny, the Deviation to Uniformity index, the Deviation to Indetermination index and the Balanced-Modularity. We use a generic version of Louvain algorithm to approach the optimal partition of the criteria with real networks of different sizes. We have found that those partitions present important differences concerning the number of clusters. The relational formalism allows us to justify these differences from a theoretical point of view. Moreover, this notation enables to easily identify the criteria having a resolution limit (a phenomenon which causes the criterion to fail to identify modules smaller than a given scale). This finding is confirmed in artificial benchmark LFR graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For more details about Relational Analysis theory see Marcotorchino and Michaud (1979) and Marcotorchino (1984).
2.
There exists a duality between the independence structure and the indetermination structure (Marcotorchino 1984, 1985; Ah-Pine and Marcotorchino 2007).
3.
Although the name of this criterion contains the word balanced, its definition is not related to the property of balance given in Definition 1.
4.
The data was taken from http://snap.stanford.edu/data/com-Amazon.html.
5.
The contribution for the Balanced Modularity will be given later.
6.
This result is a consequence of the rule this criterion relies on: “The rule of absolute majority of Condorcet” in voting theory.
7.
These expressions are deduced from the two following expressions of Balanced Modularity in terms of Newman-Girvan and Deviation to Indetermination criteria:
$$F_{BM}=2F_{NG}+\sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{2M(1-\delta )}\right) x_{ii'}$$
and
$$F_{BM}=2F_{DI}+\left( 2-\frac{1}{\delta }\right) \sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{N^2(1-\delta )} \right) x_{ii'}.$$
8.
LFR graphs are benchmark graphs introduced in Lancichinetti et al. (2008) that aim to reproduce as much as possible the structure that reflects the real properties of nodes and communities found in real networks. These artificial graphs have predefined community structure based on the mixing parameter of each node. As stated in Lancichinetti et al. (2008), for each node the mixing parameter is the fraction of its links it shares with the nodes of the network outside its community.
9.
The normalized mutual information (NMI) is a measure of similarity of two partitions. It was originated in information theory to measure the departure from independence between two random variables. Given a set of objects V and two partitions $P_1$ and $P_2$ defined on V, intuitively, the mutual information measures the information that $P_1$ and $P_2$ share. It is normalized between 0 and 1. It is worth 0 if the two partitions are independent and 1 if they are identical. Let p and q be the total number of clusters of partitions $P_1$ and $P_2$ respectively. The NMI is calculated as follows:
$$\begin{aligned} NMI(P_1,P_2)=\frac{2I(P_1,P_2)}{H(P_1)+H(P_2)} \end{aligned}$$
where:
- $I(P_1,P_2)=\sum _{u=1}^p \sum _{v=1}^q p_{uv}\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) $ is the mutual information of partitions $P_1$ and $P_2$. I tells how much we learn about $P_1$ if we know $P_2$ and vice versa. The quantity $p_{uv}=\frac{n_{uv}}{N}$ is the fraction of objects who belong simultaneously to cluster u of partition $P_1$ and to cluster v of partition $P_2$. Analogously $p_{uv}=\frac{n_{u.}}{N}$ is the fraction of objects who belong to cluster u of partition $P_1$ and $p_{uv}=\frac{n_{.v}}{N}$ is the fraction of objects who belong to cluster v of partition $P_2$ and $|V|=N$. In the case $n_{uv}=0$ we assume $\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) =0$.
- $H(P_1)=-\sum _{u=1}^p p_{u.}\ln p_{u.}$ represents the Shanon entropy of $P_1$ and $H(P_2)=-\sum _{v=1}^q p_{.v}\ln p_{.v}$ represents the Shanon entropy of $P_2$ (see Shannon (1948)).
10.
What we call small are communities ranging from 10 to 50 nodes, that is the same sizes considered by the authors of LFR graphs (see Lancichinetti and Fortunato (2009)).

References

Ah-Pine, J., & Marcotorchino, F. (2007). Statistical, geometrical and logical independences between categorical variables. In Proceedings of the ASMDA2007 Symposium, Chania, Greece.
Google Scholar
Albert, R., Jeong, H., & Barabási, A. (1999). Internet: Diameter of the world-wide web. Nature, 401(6749), 130–131.
Article Google Scholar
Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.
Article MathSciNet MATH Google Scholar
Blondel, V., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.
Article Google Scholar
Brandes, U., Delling, D., Gaertler, M., Grke, R., Hoefer, M., Nikoloski, Z., et al. (2008). On modularity clustering. IEEE Transactions on Knowledge and Data Engineering, 20(2), 172–188.
Article Google Scholar
Campigotto, R., Conde-Céspedes, P., & Guillaume, J. (2014). A generalized and adaptive method for community detection. CoRR abs/1406.2518.
Google Scholar
Conde-Céspedes, P. (2013). Modélisations et extensions du formalisme de l’Analyse Relationnelle Mathématique à la modularisation des grands graphes. Thèse de doctorat: Université Pierre et Marie Curie.
Google Scholar
Conde-Céspedes, P., & Marcotorchino, J. (2012). Modularisation et recherche de communautés dans les réseaux complexes par unification relationnelle. In Revue des Nouvelles Technologies de l’Information, Apprentissage Artificiel et Fouille de Données, RNTI-A-6 (pp. 71–97).
Google Scholar
Conde-Céspedes, P., & Marcotorchino, F. (2013). Comparison different modularization criteria using relational metric. In F. Nielsen & F. Barbaresco (Eds.), Proceedings First International Conference, Geometric Science of Information (Vol. 1, pp. 180–187). Paris: Springer.
Chapter Google Scholar
Condorcet, C. A. M. (1785). Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Journal of Mathematical Sociology, 1(1), 113–120.
Google Scholar
Decaestecker, C. (1992). Apprentissage en classification conceptuelle incrémentale. Ph.D. thesis, Université Libre de Bruxelles (Faculté des Sciences).
Google Scholar
Fortunato, S., & Barthelemy, M. (2006). Resolution limit in community detection. In Proceedings of the National Academy of Sciences of the United States of America.
Google Scholar
Gleiser, P., & Danon, L. (2003). Community structure in jazz. Advances in Complex Systems (ACS), 06(04), 565–573.
Article Google Scholar
Hoerdt, M., & Magoni, D. (2003). Proceedings of the 11th International Conference on Software, Telecommunications and Computer Networks (vol. 257).
Google Scholar
Kumpula, J., Saramäki, J., Kaski, K., & Kertesz, J. (2007). Limited resolution in complex network community detection with potts model approach. The European Physical Journal B, 56(1), 41–45.
Article Google Scholar
Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: A comparative analysis. Physical Review E, 80, 056117.
Article Google Scholar
Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78, 046110.
Google Scholar
Mancoridis, S., Mitchell, B., Rorres, C., Chen, Y., & Gansner, E. (1998). Using automatic clustering to produce high-level system organizations of source code. In The IEEE Proceedings of the 1998 International Workshop on Program Understanding (IWPC 1998) (pp. 45–52). Ischia: IEEE Computer Society.
Google Scholar
Marcotorchino, F. (1984). Utilisation des comparaisons par paires en statistique des contingences (partie i). Publication du Centre Scientifique IBM de Paris, F057, et Cahiers du Séminaire Analyse des Données et Processus Stochastiques Université Libre de Bruxelles (pp. 1–57).
Google Scholar
Marcotorchino, F. (1985). Utilisation des comparaisons par paires en statistique des contingences (partie iii). Etude F-081 du Centre Scientifique IBM de Paris (pp. 1–39).
Google Scholar
Marcotorchino, F. (2013). Optimal transport, spatial interaction models and related problems, impacts on relational metrics, adaptation to large graphs and networks modularity. Internal Publication of Thales.
Google Scholar
Marcotorchino, F., & Conde-Céspedes, P. (2013). Optimal transport and minimal trade problem, impacts on relational metrics and applications to large graphs and networks modularity. In F. Nielsen & F. Barbaresco (Eds.), Proceedings of First International Conference, Geometric Science of Information (Vol. 8085, pp. 169–179). Heidelberg: Springer.
Chapter Google Scholar
Marcotorchino, F., & Michaud, P. (1979). Optimisation en Analyse ordinale des données. Paris: Masson.
MATH Google Scholar
Michalski, R., & Stepp, R. (1983). Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, T. Mitchell, & M. Kaufmann (Eds.), Machine learning: An artificial intelligence approach, Chap. 11 (Vol. 1, pp. 331–367). Heidelberg: Springer.
Google Scholar
Mislove, A., Marcon, M., Gummadi, K., Druschel, P., & Bhattacharjee, B. (2007). Measurement and analysis of online social networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC 2007), San Diego, CA.
Google Scholar
Newman, M., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69, 026113.
Google Scholar
Owsiński, J., & Zadrożny, S. (1986). Clustering for ordinal data: A linear programming formulation. Control and Cybernetics, 15(2), 183–193.
MATH Google Scholar
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(379–423), 623–656.
Article MathSciNet MATH Google Scholar
Wei, Y., & Cheng, C. (1989). Towards efficient hierarchical designs by ratio cut partitioning. In IEEE International Conference on Computer-Aided Design (pp. 298–301).
Google Scholar
Yang, J., & Leskovec, J. (2012). Defining and evaluating network communities based on ground-truth. In International Conference on Data Mining (pp. 745–754). IEEE Computer Society. abs/1205.6233.
Google Scholar
Zahn, C. (1964). Approximating symmetric relations by equivalence relations. SIAM Journal on Applied Mathematics, 12, 840–847.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This work is supported by REQUEST and Open Food System projects.

Author information

Authors and Affiliations

L2TI, Institut Galilée, Université Paris 13, 99, av. Jean-Baptiste Clément, 93430, Villetaneuse, France
Patricia Conde-Céspedes & Emmanuel Viennet
Thales Communications et Sécurité, 4 av. des Louvresses, 92230, Gennevilliers, France
Jean-François Marcotorchino

Authors

Patricia Conde-Céspedes
View author publications
You can also search for this author in PubMed Google Scholar
Jean-François Marcotorchino
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Viennet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patricia Conde-Céspedes .

Editor information

Editors and Affiliations

Polytech Nantes, University of Nantes, Nantes Cedex 3, France
Fabrice Guillet
University of Bordeaux, Talence Cedex, France
Bruno Pinaud
Polytechnics Graduate School, University of Tours, Tours, France
Gilles Venturini

Appendix

Theorem 1

(The density of clusters obtained by maximization of Zahn-Condorcet criterion is least 50 %). Given a connected, non-oriented and unweighted graph $G=(V,E)$, the optimal partition obtained by optimizing the Zahn-Condorcet criterion has the following property: the number of within-cluster edges of each cluster is at least as half as the possible maximum existing within-cluster edges, that is to say the number of existing edges in the case the cluster is a clique. Furthermore, every node in each cluster is connected with at least as half as the total nodes inside the cluster.

Proof

Considering the constraints of reflexivity and symmetry of the relational variable $x_{ii'}$ (i.e. $x_{ii}=1 \forall i$ and $x_{ii'}=x_{i'i}$), the expression of Zahn-Condorcet criterion in Table 2 can be written as follows:

$F_{ZC}(X)=\sum _{i>i'}(a_{ii'}-\bar{a}_{ii'})x_{ii'}+N^2-2M-N$.

where:

$\sum _{i>i'}a_{ii'}x_{ii'}$ is the number of within-cluster edges for all clusters.
$\sum _{i>i'}\bar{a}_{ii'}x_{ii'}$ is the number of missing within-cluster edges for all clusters.

If we denote $E_j$ the number of within edges of cluster j, the total number of missing edges for the cluster j will be $\left( \frac{n_j(n_j-1)}{2}-E_j\right) $. So, the criterion Zahn-Condorcet will become:

$F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa \left( E_j-\left( \frac{n_j(n_j-1)}{2}-Ej\right) \right) +N^2-2M-N $,

or

$F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa (2E_j-\frac{n_j(n_j-1)}{2})+N^2-2M-N $.

the term $(2E_j-\frac{n_j(n_j-1)}{2})$ represents the contribution of cluster j to the value of the criterion. For each cluster of the optimal partition this term must be positive or null. Otherwise it would be possible to obtain a better partition by isolating each node in cluster j (the contribution to the value of the criterion by a cluster of an isolated node is null). This implies:

$(2E_j-\frac{n_j(n_j-1)}{2})\ge 0$, or $E_j\ge \frac{n_j(n_j-1)}{4}$.

So, each cluster j has a density of at least 50 %.

This result can be extended to every node of each cluster of the optimal partition. In fact, let us suppose that there is a cluster j containing a node $n_0$ which is connected with less than half of the total nodes in the cluster. Let us denote $E_{j_0}$ the connexions of $n_0$ to nodes in $C_j$. So, $E_{j_0}<=\frac{(n_j-1)}{2}$.

It is always possible to obtain a better partition by isolating $n_0$. In fact, the contribution of the two resulting clusters after isolation of node $n_0$ is:

$2(E_j-E_{j_0})-\frac{(n_j-1)(n_j-2)}{2}$

this last expression is greater than the contribution of cluster j, given by $(2E_j-\frac{n_j(n_j-1)}{2})$, if $n_0$ is connected with less than half of nodes in $C_j$.

This also proves why the partitions obtaining by optimizing Zahn-Condorcet criterion contain sometimes clusters of isolates nodes. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Conde-Céspedes, P., Marcotorchino, JF., Viennet, E. (2017). Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit. In: Guillet, F., Pinaud, B., Venturini, G. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 665. Springer, Cham. https://doi.org/10.1007/978-3-319-45763-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-45763-5_6
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45762-8
Online ISBN: 978-3-319-45763-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Theorem 1

Proof

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation