Skip to main content

Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit

  • Chapter
  • First Online:
Advances in Knowledge Discovery and Management

Abstract

The modularization of large graphs or community detection in networks is usually approached as an optimization problem of a quality function or criterion, for instance, the modularity of Newman-Girvan. There exist other clustering criteria, with their own properties leading to different solutions. In this paper we present six linear modularization criteria in relational notation such as the Newman-Girvan modularity, Zahn-Condorcet, Owsiński-Zadrożny, the Deviation to Uniformity index, the Deviation to Indetermination index and the Balanced-Modularity. We use a generic version of Louvain algorithm to approach the optimal partition of the criteria with real networks of different sizes. We have found that those partitions present important differences concerning the number of clusters. The relational formalism allows us to justify these differences from a theoretical point of view. Moreover, this notation enables to easily identify the criteria having a resolution limit (a phenomenon which causes the criterion to fail to identify modules smaller than a given scale). This finding is confirmed in artificial benchmark LFR graphs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For more details about Relational Analysis theory see Marcotorchino and Michaud (1979) and Marcotorchino (1984).

  2. 2.

    There exists a duality between the independence structure and the indetermination structure (Marcotorchino 1984, 1985; Ah-Pine and Marcotorchino 2007).

  3. 3.

    Although the name of this criterion contains the word balanced, its definition is not related to the property of balance given in Definition 1.

  4. 4.

    The data was taken from http://snap.stanford.edu/data/com-Amazon.html.

  5. 5.

    The contribution for the Balanced Modularity will be given later.

  6. 6.

    This result is a consequence of the rule this criterion relies on: “The rule of absolute majority of Condorcet” in voting theory.

  7. 7.

    These expressions are deduced from the two following expressions of Balanced Modularity in terms of Newman-Girvan and Deviation to Indetermination criteria:

    $$F_{BM}=2F_{NG}+\sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{2M(1-\delta )}\right) x_{ii'}$$

    and

    $$F_{BM}=2F_{DI}+\left( 2-\frac{1}{\delta }\right) \sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{N^2(1-\delta )} \right) x_{ii'}.$$
  8. 8.

    LFR graphs are benchmark graphs introduced in Lancichinetti et al. (2008) that aim to reproduce as much as possible the structure that reflects the real properties of nodes and communities found in real networks. These artificial graphs have predefined community structure based on the mixing parameter of each node. As stated in Lancichinetti et al. (2008), for each node the mixing parameter is the fraction of its links it shares with the nodes of the network outside its community.

  9. 9.

    The normalized mutual information (NMI) is a measure of similarity of two partitions. It was originated in information theory to measure the departure from independence between two random variables. Given a set of objects V and two partitions \(P_1\) and \(P_2\) defined on V, intuitively, the mutual information measures the information that \(P_1\) and \(P_2\) share. It is normalized between 0 and 1. It is worth 0 if the two partitions are independent and 1 if they are identical. Let p and q be the total number of clusters of partitions \(P_1\) and \(P_2\) respectively. The NMI is calculated as follows:

    $$\begin{aligned} NMI(P_1,P_2)=\frac{2I(P_1,P_2)}{H(P_1)+H(P_2)} \end{aligned}$$

    where:

    • \(I(P_1,P_2)=\sum _{u=1}^p \sum _{v=1}^q p_{uv}\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) \) is the mutual information of partitions \(P_1\) and \(P_2\). I tells how much we learn about \(P_1\) if we know \(P_2\) and vice versa. The quantity \(p_{uv}=\frac{n_{uv}}{N}\) is the fraction of objects who belong simultaneously to cluster u of partition \(P_1\) and to cluster v of partition \(P_2\). Analogously \(p_{uv}=\frac{n_{u.}}{N}\) is the fraction of objects who belong to cluster u of partition \(P_1\) and \(p_{uv}=\frac{n_{.v}}{N}\) is the fraction of objects who belong to cluster v of partition \(P_2\) and \(|V|=N\). In the case \(n_{uv}=0\) we assume \(\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) =0\).

    • \(H(P_1)=-\sum _{u=1}^p p_{u.}\ln p_{u.}\) represents the Shanon entropy of \(P_1\) and \(H(P_2)=-\sum _{v=1}^q p_{.v}\ln p_{.v}\) represents the Shanon entropy of \(P_2\) (see Shannon (1948)).

  10. 10.

    What we call small are communities ranging from 10 to 50 nodes, that is the same sizes considered by the authors of LFR graphs (see Lancichinetti and Fortunato (2009)).

References

  • Ah-Pine, J., & Marcotorchino, F. (2007). Statistical, geometrical and logical independences between categorical variables. In Proceedings of the ASMDA2007 Symposium, Chania, Greece.

    Google Scholar 

  • Albert, R., Jeong, H., & Barabási, A. (1999). Internet: Diameter of the world-wide web. Nature, 401(6749), 130–131.

    Article  Google Scholar 

  • Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.

    Article  MathSciNet  MATH  Google Scholar 

  • Blondel, V., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.

    Article  Google Scholar 

  • Brandes, U., Delling, D., Gaertler, M., Grke, R., Hoefer, M., Nikoloski, Z., et al. (2008). On modularity clustering. IEEE Transactions on Knowledge and Data Engineering, 20(2), 172–188.

    Article  Google Scholar 

  • Campigotto, R., Conde-Céspedes, P., & Guillaume, J. (2014). A generalized and adaptive method for community detection. CoRR abs/1406.2518.

    Google Scholar 

  • Conde-Céspedes, P. (2013). Modélisations et extensions du formalisme de l’Analyse Relationnelle Mathématique à la modularisation des grands graphes. Thèse de doctorat: Université Pierre et Marie Curie.

    Google Scholar 

  • Conde-Céspedes, P., & Marcotorchino, J. (2012). Modularisation et recherche de communautés dans les réseaux complexes par unification relationnelle. In Revue des Nouvelles Technologies de l’Information, Apprentissage Artificiel et Fouille de Données, RNTI-A-6 (pp. 71–97).

    Google Scholar 

  • Conde-Céspedes, P., & Marcotorchino, F. (2013). Comparison different modularization criteria using relational metric. In F. Nielsen & F. Barbaresco (Eds.), Proceedings First International Conference, Geometric Science of Information (Vol. 1, pp. 180–187). Paris: Springer.

    Chapter  Google Scholar 

  • Condorcet, C. A. M. (1785). Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Journal of Mathematical Sociology, 1(1), 113–120.

    Google Scholar 

  • Decaestecker, C. (1992). Apprentissage en classification conceptuelle incrémentale. Ph.D. thesis, Université Libre de Bruxelles (Faculté des Sciences).

    Google Scholar 

  • Fortunato, S., & Barthelemy, M. (2006). Resolution limit in community detection. In Proceedings of the National Academy of Sciences of the United States of America.

    Google Scholar 

  • Gleiser, P., & Danon, L. (2003). Community structure in jazz. Advances in Complex Systems (ACS), 06(04), 565–573.

    Article  Google Scholar 

  • Hoerdt, M., & Magoni, D. (2003). Proceedings of the 11th International Conference on Software, Telecommunications and Computer Networks (vol. 257).

    Google Scholar 

  • Kumpula, J., Saramäki, J., Kaski, K., & Kertesz, J. (2007). Limited resolution in complex network community detection with potts model approach. The European Physical Journal B, 56(1), 41–45.

    Article  Google Scholar 

  • Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: A comparative analysis. Physical Review E, 80, 056117.

    Article  Google Scholar 

  • Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78, 046110.

    Google Scholar 

  • Mancoridis, S., Mitchell, B., Rorres, C., Chen, Y., & Gansner, E. (1998). Using automatic clustering to produce high-level system organizations of source code. In The IEEE Proceedings of the 1998 International Workshop on Program Understanding (IWPC 1998) (pp. 45–52). Ischia: IEEE Computer Society.

    Google Scholar 

  • Marcotorchino, F. (1984). Utilisation des comparaisons par paires en statistique des contingences (partie i). Publication du Centre Scientifique IBM de Paris, F057, et Cahiers du Séminaire Analyse des Données et Processus Stochastiques Université Libre de Bruxelles (pp. 1–57).

    Google Scholar 

  • Marcotorchino, F. (1985). Utilisation des comparaisons par paires en statistique des contingences (partie iii). Etude F-081 du Centre Scientifique IBM de Paris (pp. 1–39).

    Google Scholar 

  • Marcotorchino, F. (2013). Optimal transport, spatial interaction models and related problems, impacts on relational metrics, adaptation to large graphs and networks modularity. Internal Publication of Thales.

    Google Scholar 

  • Marcotorchino, F., & Conde-Céspedes, P. (2013). Optimal transport and minimal trade problem, impacts on relational metrics and applications to large graphs and networks modularity. In F. Nielsen & F. Barbaresco (Eds.), Proceedings of First International Conference, Geometric Science of Information (Vol. 8085, pp. 169–179). Heidelberg: Springer.

    Chapter  Google Scholar 

  • Marcotorchino, F., & Michaud, P. (1979). Optimisation en Analyse ordinale des données. Paris: Masson.

    MATH  Google Scholar 

  • Michalski, R., & Stepp, R. (1983). Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, T. Mitchell, & M. Kaufmann (Eds.), Machine learning: An artificial intelligence approach, Chap. 11 (Vol. 1, pp. 331–367). Heidelberg: Springer.

    Google Scholar 

  • Mislove, A., Marcon, M., Gummadi, K., Druschel, P., & Bhattacharjee, B. (2007). Measurement and analysis of online social networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC 2007), San Diego, CA.

    Google Scholar 

  • Newman, M., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69, 026113.

    Google Scholar 

  • Owsiński, J., & Zadrożny, S. (1986). Clustering for ordinal data: A linear programming formulation. Control and Cybernetics, 15(2), 183–193.

    MATH  Google Scholar 

  • Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(379–423), 623–656.

    Article  MathSciNet  MATH  Google Scholar 

  • Wei, Y., & Cheng, C. (1989). Towards efficient hierarchical designs by ratio cut partitioning. In IEEE International Conference on Computer-Aided Design (pp. 298–301).

    Google Scholar 

  • Yang, J., & Leskovec, J. (2012). Defining and evaluating network communities based on ground-truth. In International Conference on Data Mining (pp. 745–754). IEEE Computer Society. abs/1205.6233.

    Google Scholar 

  • Zahn, C. (1964). Approximating symmetric relations by equivalence relations. SIAM Journal on Applied Mathematics, 12, 840–847.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work is supported by REQUEST and Open Food System projects.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patricia Conde-Céspedes .

Editor information

Editors and Affiliations

Appendix

Appendix

Theorem 1

(The density of clusters obtained by maximization of Zahn-Condorcet criterion is least 50 %). Given a connected, non-oriented and unweighted graph \(G=(V,E)\), the optimal partition obtained by optimizing the Zahn-Condorcet criterion has the following property: the number of within-cluster edges of each cluster is at least as half as the possible maximum existing within-cluster edges, that is to say the number of existing edges in the case the cluster is a clique. Furthermore, every node in each cluster is connected with at least as half as the total nodes inside the cluster.

Proof

Considering the constraints of reflexivity and symmetry of the relational variable \(x_{ii'}\) (i.e. \(x_{ii}=1 \forall i\) and \(x_{ii'}=x_{i'i}\)), the expression of Zahn-Condorcet criterion in Table 2 can be written as follows:

\(F_{ZC}(X)=\sum _{i>i'}(a_{ii'}-\bar{a}_{ii'})x_{ii'}+N^2-2M-N\).

where:

  • \(\sum _{i>i'}a_{ii'}x_{ii'}\) is the number of within-cluster edges for all clusters.

  • \(\sum _{i>i'}\bar{a}_{ii'}x_{ii'}\) is the number of missing within-cluster edges for all clusters.

If we denote \(E_j\) the number of within edges of cluster j, the total number of missing edges for the cluster j will be \(\left( \frac{n_j(n_j-1)}{2}-E_j\right) \). So, the criterion Zahn-Condorcet will become:

\(F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa \left( E_j-\left( \frac{n_j(n_j-1)}{2}-Ej\right) \right) +N^2-2M-N \),

or

\(F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa (2E_j-\frac{n_j(n_j-1)}{2})+N^2-2M-N \).

the term \((2E_j-\frac{n_j(n_j-1)}{2})\) represents the contribution of cluster j to the value of the criterion. For each cluster of the optimal partition this term must be positive or null. Otherwise it would be possible to obtain a better partition by isolating each node in cluster j (the contribution to the value of the criterion by a cluster of an isolated node is null). This implies:

\((2E_j-\frac{n_j(n_j-1)}{2})\ge 0\), or \(E_j\ge \frac{n_j(n_j-1)}{4}\).

So, each cluster j has a density of at least 50 %.

This result can be extended to every node of each cluster of the optimal partition. In fact, let us suppose that there is a cluster j containing a node \(n_0\) which is connected with less than half of the total nodes in the cluster. Let us denote \(E_{j_0}\) the connexions of \(n_0\) to nodes in \(C_j\). So, \(E_{j_0}<=\frac{(n_j-1)}{2}\).

It is always possible to obtain a better partition by isolating \(n_0\). In fact, the contribution of the two resulting clusters after isolation of node \(n_0\) is:

\(2(E_j-E_{j_0})-\frac{(n_j-1)(n_j-2)}{2}\)

this last expression is greater than the contribution of cluster j, given by \((2E_j-\frac{n_j(n_j-1)}{2})\), if \(n_0\) is connected with less than half of nodes in \(C_j\).

This also proves why the partitions obtaining by optimizing Zahn-Condorcet criterion contain sometimes clusters of isolates nodes. \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Conde-Céspedes, P., Marcotorchino, JF., Viennet, E. (2017). Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit. In: Guillet, F., Pinaud, B., Venturini, G. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 665. Springer, Cham. https://doi.org/10.1007/978-3-319-45763-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45763-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45762-8

  • Online ISBN: 978-3-319-45763-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics