Abstract
The modularization of large graphs or community detection in networks is usually approached as an optimization problem of a quality function or criterion, for instance, the modularity of Newman-Girvan. There exist other clustering criteria, with their own properties leading to different solutions. In this paper we present six linear modularization criteria in relational notation such as the Newman-Girvan modularity, Zahn-Condorcet, Owsiński-Zadrożny, the Deviation to Uniformity index, the Deviation to Indetermination index and the Balanced-Modularity. We use a generic version of Louvain algorithm to approach the optimal partition of the criteria with real networks of different sizes. We have found that those partitions present important differences concerning the number of clusters. The relational formalism allows us to justify these differences from a theoretical point of view. Moreover, this notation enables to easily identify the criteria having a resolution limit (a phenomenon which causes the criterion to fail to identify modules smaller than a given scale). This finding is confirmed in artificial benchmark LFR graphs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Although the name of this criterion contains the word balanced, its definition is not related to the property of balance given in Definition 1.
- 4.
The data was taken from http://snap.stanford.edu/data/com-Amazon.html.
- 5.
The contribution for the Balanced Modularity will be given later.
- 6.
This result is a consequence of the rule this criterion relies on: “The rule of absolute majority of Condorcet” in voting theory.
- 7.
These expressions are deduced from the two following expressions of Balanced Modularity in terms of Newman-Girvan and Deviation to Indetermination criteria:
$$F_{BM}=2F_{NG}+\sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{2M(1-\delta )}\right) x_{ii'}$$and
$$F_{BM}=2F_{DI}+\left( 2-\frac{1}{\delta }\right) \sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{N^2(1-\delta )} \right) x_{ii'}.$$ - 8.
LFR graphs are benchmark graphs introduced in Lancichinetti et al. (2008) that aim to reproduce as much as possible the structure that reflects the real properties of nodes and communities found in real networks. These artificial graphs have predefined community structure based on the mixing parameter of each node. As stated in Lancichinetti et al. (2008), for each node the mixing parameter is the fraction of its links it shares with the nodes of the network outside its community.
- 9.
The normalized mutual information (NMI) is a measure of similarity of two partitions. It was originated in information theory to measure the departure from independence between two random variables. Given a set of objects V and two partitions \(P_1\) and \(P_2\) defined on V, intuitively, the mutual information measures the information that \(P_1\) and \(P_2\) share. It is normalized between 0 and 1. It is worth 0 if the two partitions are independent and 1 if they are identical. Let p and q be the total number of clusters of partitions \(P_1\) and \(P_2\) respectively. The NMI is calculated as follows:
$$\begin{aligned} NMI(P_1,P_2)=\frac{2I(P_1,P_2)}{H(P_1)+H(P_2)} \end{aligned}$$where:
-
\(I(P_1,P_2)=\sum _{u=1}^p \sum _{v=1}^q p_{uv}\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) \) is the mutual information of partitions \(P_1\) and \(P_2\). I tells how much we learn about \(P_1\) if we know \(P_2\) and vice versa. The quantity \(p_{uv}=\frac{n_{uv}}{N}\) is the fraction of objects who belong simultaneously to cluster u of partition \(P_1\) and to cluster v of partition \(P_2\). Analogously \(p_{uv}=\frac{n_{u.}}{N}\) is the fraction of objects who belong to cluster u of partition \(P_1\) and \(p_{uv}=\frac{n_{.v}}{N}\) is the fraction of objects who belong to cluster v of partition \(P_2\) and \(|V|=N\). In the case \(n_{uv}=0\) we assume \(\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) =0\).
-
\(H(P_1)=-\sum _{u=1}^p p_{u.}\ln p_{u.}\) represents the Shanon entropy of \(P_1\) and \(H(P_2)=-\sum _{v=1}^q p_{.v}\ln p_{.v}\) represents the Shanon entropy of \(P_2\) (see Shannon (1948)).
-
- 10.
What we call small are communities ranging from 10 to 50 nodes, that is the same sizes considered by the authors of LFR graphs (see Lancichinetti and Fortunato (2009)).
References
Ah-Pine, J., & Marcotorchino, F. (2007). Statistical, geometrical and logical independences between categorical variables. In Proceedings of the ASMDA2007 Symposium, Chania, Greece.
Albert, R., Jeong, H., & Barabási, A. (1999). Internet: Diameter of the world-wide web. Nature, 401(6749), 130–131.
Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.
Blondel, V., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.
Brandes, U., Delling, D., Gaertler, M., Grke, R., Hoefer, M., Nikoloski, Z., et al. (2008). On modularity clustering. IEEE Transactions on Knowledge and Data Engineering, 20(2), 172–188.
Campigotto, R., Conde-Céspedes, P., & Guillaume, J. (2014). A generalized and adaptive method for community detection. CoRR abs/1406.2518.
Conde-Céspedes, P. (2013). Modélisations et extensions du formalisme de l’Analyse Relationnelle Mathématique à la modularisation des grands graphes. Thèse de doctorat: Université Pierre et Marie Curie.
Conde-Céspedes, P., & Marcotorchino, J. (2012). Modularisation et recherche de communautés dans les réseaux complexes par unification relationnelle. In Revue des Nouvelles Technologies de l’Information, Apprentissage Artificiel et Fouille de Données, RNTI-A-6 (pp. 71–97).
Conde-Céspedes, P., & Marcotorchino, F. (2013). Comparison different modularization criteria using relational metric. In F. Nielsen & F. Barbaresco (Eds.), Proceedings First International Conference, Geometric Science of Information (Vol. 1, pp. 180–187). Paris: Springer.
Condorcet, C. A. M. (1785). Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Journal of Mathematical Sociology, 1(1), 113–120.
Decaestecker, C. (1992). Apprentissage en classification conceptuelle incrémentale. Ph.D. thesis, Université Libre de Bruxelles (Faculté des Sciences).
Fortunato, S., & Barthelemy, M. (2006). Resolution limit in community detection. In Proceedings of the National Academy of Sciences of the United States of America.
Gleiser, P., & Danon, L. (2003). Community structure in jazz. Advances in Complex Systems (ACS), 06(04), 565–573.
Hoerdt, M., & Magoni, D. (2003). Proceedings of the 11th International Conference on Software, Telecommunications and Computer Networks (vol. 257).
Kumpula, J., Saramäki, J., Kaski, K., & Kertesz, J. (2007). Limited resolution in complex network community detection with potts model approach. The European Physical Journal B, 56(1), 41–45.
Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: A comparative analysis. Physical Review E, 80, 056117.
Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78, 046110.
Mancoridis, S., Mitchell, B., Rorres, C., Chen, Y., & Gansner, E. (1998). Using automatic clustering to produce high-level system organizations of source code. In The IEEE Proceedings of the 1998 International Workshop on Program Understanding (IWPC 1998) (pp. 45–52). Ischia: IEEE Computer Society.
Marcotorchino, F. (1984). Utilisation des comparaisons par paires en statistique des contingences (partie i). Publication du Centre Scientifique IBM de Paris, F057, et Cahiers du Séminaire Analyse des Données et Processus Stochastiques Université Libre de Bruxelles (pp. 1–57).
Marcotorchino, F. (1985). Utilisation des comparaisons par paires en statistique des contingences (partie iii). Etude F-081 du Centre Scientifique IBM de Paris (pp. 1–39).
Marcotorchino, F. (2013). Optimal transport, spatial interaction models and related problems, impacts on relational metrics, adaptation to large graphs and networks modularity. Internal Publication of Thales.
Marcotorchino, F., & Conde-Céspedes, P. (2013). Optimal transport and minimal trade problem, impacts on relational metrics and applications to large graphs and networks modularity. In F. Nielsen & F. Barbaresco (Eds.), Proceedings of First International Conference, Geometric Science of Information (Vol. 8085, pp. 169–179). Heidelberg: Springer.
Marcotorchino, F., & Michaud, P. (1979). Optimisation en Analyse ordinale des données. Paris: Masson.
Michalski, R., & Stepp, R. (1983). Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, T. Mitchell, & M. Kaufmann (Eds.), Machine learning: An artificial intelligence approach, Chap. 11 (Vol. 1, pp. 331–367). Heidelberg: Springer.
Mislove, A., Marcon, M., Gummadi, K., Druschel, P., & Bhattacharjee, B. (2007). Measurement and analysis of online social networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC 2007), San Diego, CA.
Newman, M., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69, 026113.
Owsiński, J., & Zadrożny, S. (1986). Clustering for ordinal data: A linear programming formulation. Control and Cybernetics, 15(2), 183–193.
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(379–423), 623–656.
Wei, Y., & Cheng, C. (1989). Towards efficient hierarchical designs by ratio cut partitioning. In IEEE International Conference on Computer-Aided Design (pp. 298–301).
Yang, J., & Leskovec, J. (2012). Defining and evaluating network communities based on ground-truth. In International Conference on Data Mining (pp. 745–754). IEEE Computer Society. abs/1205.6233.
Zahn, C. (1964). Approximating symmetric relations by equivalence relations. SIAM Journal on Applied Mathematics, 12, 840–847.
Acknowledgments
This work is supported by REQUEST and Open Food System projects.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Theorem 1
(The density of clusters obtained by maximization of Zahn-Condorcet criterion is least 50 %). Given a connected, non-oriented and unweighted graph \(G=(V,E)\), the optimal partition obtained by optimizing the Zahn-Condorcet criterion has the following property: the number of within-cluster edges of each cluster is at least as half as the possible maximum existing within-cluster edges, that is to say the number of existing edges in the case the cluster is a clique. Furthermore, every node in each cluster is connected with at least as half as the total nodes inside the cluster.
Proof
Considering the constraints of reflexivity and symmetry of the relational variable \(x_{ii'}\) (i.e. \(x_{ii}=1 \forall i\) and \(x_{ii'}=x_{i'i}\)), the expression of Zahn-Condorcet criterion in Table 2 can be written as follows:
\(F_{ZC}(X)=\sum _{i>i'}(a_{ii'}-\bar{a}_{ii'})x_{ii'}+N^2-2M-N\).
where:
-
\(\sum _{i>i'}a_{ii'}x_{ii'}\) is the number of within-cluster edges for all clusters.
-
\(\sum _{i>i'}\bar{a}_{ii'}x_{ii'}\) is the number of missing within-cluster edges for all clusters.
If we denote \(E_j\) the number of within edges of cluster j, the total number of missing edges for the cluster j will be \(\left( \frac{n_j(n_j-1)}{2}-E_j\right) \). So, the criterion Zahn-Condorcet will become:
\(F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa \left( E_j-\left( \frac{n_j(n_j-1)}{2}-Ej\right) \right) +N^2-2M-N \),
or
\(F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa (2E_j-\frac{n_j(n_j-1)}{2})+N^2-2M-N \).
the term \((2E_j-\frac{n_j(n_j-1)}{2})\) represents the contribution of cluster j to the value of the criterion. For each cluster of the optimal partition this term must be positive or null. Otherwise it would be possible to obtain a better partition by isolating each node in cluster j (the contribution to the value of the criterion by a cluster of an isolated node is null). This implies:
\((2E_j-\frac{n_j(n_j-1)}{2})\ge 0\), or \(E_j\ge \frac{n_j(n_j-1)}{4}\).
So, each cluster j has a density of at least 50 %.
This result can be extended to every node of each cluster of the optimal partition. In fact, let us suppose that there is a cluster j containing a node \(n_0\) which is connected with less than half of the total nodes in the cluster. Let us denote \(E_{j_0}\) the connexions of \(n_0\) to nodes in \(C_j\). So, \(E_{j_0}<=\frac{(n_j-1)}{2}\).
It is always possible to obtain a better partition by isolating \(n_0\). In fact, the contribution of the two resulting clusters after isolation of node \(n_0\) is:
\(2(E_j-E_{j_0})-\frac{(n_j-1)(n_j-2)}{2}\)
this last expression is greater than the contribution of cluster j, given by \((2E_j-\frac{n_j(n_j-1)}{2})\), if \(n_0\) is connected with less than half of nodes in \(C_j\).
This also proves why the partitions obtaining by optimizing Zahn-Condorcet criterion contain sometimes clusters of isolates nodes. \(\square \)
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Conde-Céspedes, P., Marcotorchino, JF., Viennet, E. (2017). Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit. In: Guillet, F., Pinaud, B., Venturini, G. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 665. Springer, Cham. https://doi.org/10.1007/978-3-319-45763-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-45763-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45762-8
Online ISBN: 978-3-319-45763-5
eBook Packages: EngineeringEngineering (R0)