Abstract
Finding dense subgraphs is an important problem in graph mining and has many practical applications. At the same time, while large real-world networks are known to have many communities that are not well-separated, the majority of the existing work focuses on the problem of finding a single densest subgraph. Hence, it is natural to consider the question of finding the top-k densest subgraphs. One major challenge in addressing this question is how to handle overlaps: eliminating overlaps completely is one option, but this may lead to extracting subgraphs not as dense as it would be possible by allowing a limited amount of overlap. Furthermore, overlaps are desirable as in most real-world graphs there are vertices that belong to more than one community, and thus, to more than one densest subgraph. In this paper we study the problem of finding top-k overlapping densest subgraphs, and we present a new approach that improves over the existing techniques, both in theory and practice. First, we reformulate the problem definition in a way that we are able to obtain an algorithm with constant-factor approximation guarantee. Our approach relies on using techniques for solving the max-sum diversification problem, which however, we need to extend in order to make them applicable to our setting. Second, we evaluate our algorithm on a collection of benchmark datasets and show that it convincingly outperforms the previous methods, both in terms of quality and efficiency.
Similar content being viewed by others
Notes
Here we use the fact that edges are not weighted, and consequently the queue can be implemented as an array of linked lists of vertices.
The synthetic networks used in our experiments are available at http://research.ics.aalto.fi/dmg/dos_synth.tgz.
Namely, S. Abiteboul, E. Demaine, M. Ester, C. Faloutsos, J. Han, G. Karypis, J. Kleinberg, H. Mannila, K. Mehlhorn, C. Papadimitriou, B. Shneiderman, G. Weikum and P. Yu.
Namely, Oceania, Latin-America, the USA, Europe, the Middle-East and East Asia.
References
Ahn Y-Y, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466:761–764
Andersen R, Chellapilla K (2009) Finding dense subgraphs with size bounds. In: Proceedings of the 6th international workshop on algorithms and models for the web-graph (WAW), p 25–37
Angel A, Sarkas N, Koudas N, Srivastava D (2012) Dense subgraph maintenance under streaming edge weight updates for real-time story identification. Proc Very Large Data Bases Endow 5(6):574–585
Asahiro Y, Iwama K, Tamaki H, Tokuyama T (1996) Greedily finding a dense subgraph. In: Proceedings of the 5th Scandinavian workshop on algorithm theory (SWAT), p 136–148
Balalau OD, Bonchi F, Chan TH, Gullo F, Sozio M (2015) Finding subgraphs with maximum total density and limited overlap. In: Proceedings of the 8th ACM international conference on web search and data mining (WSDM), p 379–388
Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:2008
Borodin A, Lee HC, Ye Y (2012) Max-sum diversification, monotone submodular functions and dynamic updates. In: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS), p 155–166
Charikar M (2000) Greedy approximation algorithms for finding dense components in a graph. In: Proceedings of the 3rd international workshop on approximation algorithms for combinatorial optimization (APPROX), p 84–95
Chen M, Kuzmin K, Szymanski B (2014) Extension of modularity density for overlapping community structure. In: Proceedings of the 2014 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), p 856–863
Chen W, Liu Z, Sun X, Wang Y (2010) A game-theoretic framework to identify overlapping communities in social networks. Data Min Knowl Discov 21(2):224–240
Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70:066111
Coscia M, Rossetti G, Giannotti F, Pedreschi D (2012) DEMON: a local-first discovery method for overlapping communities. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 615–623
Feige U, Peleg D, Kortsarz G (2001) The dense \(k\)-subgraph problem. Algorithmica 29(3):410–421
Flake GW, Lawrence S, Giles CL (2000) Efficient identification of web communities. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 150–160
Fratkin E, Naughton BT, Brutlag DL, Batzoglou S (2006) MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics 22(14):150–157
Galbrun E, Gionis A, Tatti N (2014) Overlapping community detection in labeled graphs. Data Min Knowl Discov 28(5–6):1586–1610
Garey M, Johnson D (1979) Computers and intractability: a guide to the theory of NP-completeness. WH Freeman and Co., New York
Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99:7821–7826
Goldberg AV (1984) Finding a maximum density subgraph. Technical report. University of California, Berkeley
Gregory S (2007) An algorithm to find overlapping community structure in networks. In: Proceedings of the 2007 European conference on principles and practice of knowledge discovery in databases, Part I (ECML/PKDD), p 91–102
Gregory S (2010) Finding overlapping communities in networks by label propagation. N J Phys 12(10):103018
Håstad J (1996) Clique is hard to approximate within \(n^{1-\epsilon }.\) In: Proceedings of the 37th annual symposium on foundations of computer science (FOCS), p 627–636
Karypis G, Kumar V (1998) Multilevel algorithms for multi-constraint graph partitioning. In: Proceedings of the ACM/IEEE conference on supercomputing (SC). IEEE Computer Society, Washington, DC, p 1–13
Khuller S, Saha B (2009) On finding dense subgraphs. In: Automata, languages and programming, p 597–608
Kumar R, Raghavan P, Rajagopalan S, Tomkins A (1999) Trawling the Web for emerging cyber-communities. Comput Netw 31(11–16):1481–1493
Leskovec J, Lang K, Dasgupta A, Mahoney M (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Nemhauser G, Wolsey L, Fisher M (1978) An analysis of approximations for maximizing submodular set functions: I. Math Program 14(1):265–294
Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems (NIPS), p 849–856
Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435:814–818
Pinney J, Westhead D (2006) Betweenness-based decomposition methods for social and biological networks. In: Interdisciplinary statistics and bioinformatics. Leeds University Press, Leeds, p 87–90
Pons P, Latapy M (2006) Computing communities in large networks using random walks. J Graph Algorithms Appl 10(2):284–293
Schrijver A (2003) Combinatorial optimization. Springer, Berlin
Sozio M, Gionis A (2010) The community-search problem and how to plan a successful cocktail party. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 939–948
Tatti N, Gionis A (2015) Density-friendly graph decomposition. In: Proceedings of the 24th international conference on world wide web (WWW), p 1089–1099
Tsourakakis C (2015) The k-clique densest subgraph problem. In: Proceedings of the 24th international conference on world wide web (WWW), p 1122–1132
Tsourakakis C, Bonchi F, Gionis A, Gullo F, Tsiarli M (2013) Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 104–112
van Dongen S (2000) Graph clustering by flow simulation. PhD Thesis, University of Utrecht
von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
White S, Smyth P (2005) A spectral clustering approach to finding communities in graph. In: Proceedings of the 2005 SIAM international conference on data mining, p 76–84
Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput Surv 45(4):43
Xie J, Szymanski BK, Liu X (2011) SLPA: uncovering overlapping communities in social networks via a speaker–listener interaction dynamic process. In: International conference on data mining workshops (ICDMW)
Yang J, Leskovec J (2012) Community-affiliation graph model for overlapping network community detection. In: Proceedings of the 12th IEEE international conference on data mining (ICDM), p 1170–1175
Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the 6th ACM international conference on web search and data mining (WSDM), p 587–596
Zachary W (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33:452–473
Zhou H, Lipowsky R (2004) Network Brownian motion: a new method to measure vertex–vertex proximity and to identify communities and subcommunities. Comput Sci (ICCS) 3038:1062–1069
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.
Appendices
Appendix: Proof of Proposition 1
Let us first define \(h(x;\,Y) = \left[ f(x \cup Y) - f(Y)\right] / 2\) and
For proving the proposition, we will need Lemma 1.
Lemma 1
Let \( d \) be a c-relaxed metric. Let X and Y be two disjoint sets. Then
Proof
Let \(y \in Y\) and \(x,\,z \in X.\) By definition,
For a given \(x \in X,\) there are exactly \({\left| X\right| } - 1\) pairs \((x,\,z)\) such that \(x \ne z \in X.\) Consequently, summing over all \(x,\,z \in X\) such that \(x \ne z\) gives us
Summing over \(y \in Y\) proves the lemma. \(\square \)
Proof of Proposition 1
Let \(G_1 \subset \cdots \subset G_k\) be the sets during Greedy. Fix \(1 \le i \le k.\) Then \(G_i\) is the current solution after ith iteration of Greedy.
Let O be the optimal solution. Write \(A = O \cap G_i,\,C = O \setminus A,\) and \(B = G_i \setminus A.\) Lemma 1 implies that
which in turn implies
Moreover, Lemma 1 implies that
which, together with \({\left| C\right| } = {\left| B\right| } + k - i,\) implies
Combining these two inequalities leads us to
Submodularity and monotonicity imply
Let \(u_i\) be the item added at the \(i + 1\)th step, \(G_{i + 1} = \left\{ u_i\right\} \cup G_i.\) Then, since \(g(u_i;\,G_i) \ge \alpha g(v;\,G_i)\) for any \(v \in C,\)
Summing over i gives us
Since \(\alpha \le 1\) and \(c \ge 1,\) we have
which completes the proof. \(\square \)
Proof of Proposition 4
To prove the proposition we need to first show that Modify does not decrease the gain of a set significantly.
Lemma 2
Assume a graph \(G = (V,\,E).\) Assume a collection of k distinct subgraphs \(\mathcal {W}\) of G, and let \(U \in \mathcal {W}.\) Assume that \(k < {\left| V\right| }\) and G contains more than k wedges, i.e., connected subgraphs of size 3. Let \(M = \mathsf{{Modify}} (U,\, G,\, \mathcal {W},\, \lambda ).\) Then \( \chi \mathopen {}\left( V;\,\mathcal {W}\right) \ge 2/5 \times ( \chi \mathopen {}\left( U,\,\mathcal {W}\right) +\lambda ).\)
Proof
Write \(r = {\left| U\right| }\) and \(\alpha = \frac{r}{r + 1}.\) We will split the proof in two cases. Case 1 assume that X, as given in Algorithm 3, is not empty. Select \(B \in X.\) We will show that
for any \(W \in \mathcal {W},\) where \(I[U = W] = 1\) if \(U = W,\) and 0 otherwise. This automatically guarantees that
proving the result since \(\alpha \ge 1/2\) and the gain of M is at least as good as the gain of B.
To prove the first inequality, note that
To prove the second inequality fix \(W \in \mathcal {W},\) and let \(p = {\left| W\right| },\,q = {\left| W \cap U\right| }.\) Define
Let v be the only vertex in \(B \setminus U.\) If \(v \notin W,\) then \( D \mathopen {}\left( B,\,W\right) \ge \varDelta .\) Hence, we can assume that \(v \in W.\) This leads to
Let us define \(\beta \) as the fraction of the numerators,
We wish to show that \(\beta \ge 1.\) Since \(p \ge q + 1,\)
The ratio of distances is now
This proves the first case.
Case 2 assume that \(X = \emptyset .\) Then we must have \(Y \ne \emptyset \) and \(r \ge 2,\) as otherwise \({\left| \mathcal {W}\right| } \ge {\left| V\right| },\) which violates the assumption of the lemma.
Assume that \( \mathrm {dens} \mathopen {}\left( U\right) \ge 5/3.\) Let \(B \in Y.\) Removing a single item of U decreases the density by 1, at most. This gives us
To bound the distance term, fix \(W \in \mathcal {W},\) and let \(p = {\left| W\right| },\,q = {\left| W \cap U\right| }.\) Let v be the only vertex in \(U \setminus V.\) Define \(\varDelta = D \mathopen {}\left( U,\,W\right) + I[U = W].\) If \(v \in W,\) then we can easily show that \( D \mathopen {}\left( V,\,W\right) \ge \varDelta .\) Hence, assume that \(v \notin W.\) This implies that \(q \le \min p,\,r - 1,\) or \(q^2 \le p(r - 1).\) As before, we can express the distance term as
and
The ratio is then
where the first inequality follows from the fact that the ratio is decreasing as function of q.
Assume that \( \mathrm {dens} \mathopen {}\left( U\right) < 5/ 3.\) By assumption there is a wedge B outside \(\mathcal {W}.\) Since \( \mathrm {dens} \mathopen {}\left( B\right) \ge 2/3,\) we have \( \mathrm {dens} \mathopen {}\left( B\right) / \mathrm {dens} \mathopen {}\left( U\right) \ge 2 / 5.\) The distance terms decrease by a factor of 1/2, since
Combining the inequalities proves that
which proves the lemma. \(\square \)
Proof of Proposition 4
To prove the proposition, we will first form a new graph H, and show that the density of a subgraph in H is closely related to the gain. This then allows us to prove the statement.
Let us first construct the graph \(H{\text {:}}\) given a vertex v let us define
Let \(H = (V,\, E^{\prime },\, c)\) be a fully connected weighted graph with self-loops where the weight of an edge \(c(v,\, w)\) is
for \(v \ne w,\) and \(c(v,\, v) = s(v).\)
Next, we connect the gain of set of vertices U (w.r.t. G) with the weighted density of U in H. Given an arbitrary set of vertices U, we will write c(U) to mean the total weight of edges in H. Each \(c(v,\, w),\) for \(v \in w,\) participates in \(\deg _{H}(v;\, U)\) and \(\deg _{H}(w;\, U),\) and each \(c(v,\, v) = s(v)\) participates (once) in \(\deg _{H}(v;\, U).\) This leads to
We can express the (weighted) degree of a vertex in H as
Write \(k = {\left| \mathcal {W}\right| }.\) These equalities lead to the following identity,
where \(\epsilon (U,\, \mathcal {W})\) is a correction term, equal to 2\(\lambda \) if \(U \in \mathcal {W},\) and 0 otherwise.
Let O be the densest subgraph in H. Next we show that during the for-loop Peel finds a graph whose density close to \( \mathrm {dens} \mathopen {}\left( O;\, H\right) .\) Let o be the first vertex in O deleted by Peel. We must have
as otherwise we can delete o from O and obtain a better solution. Let \(R = V_i\) be the graph at the moment when o is about to be removed. Let us compare \(\deg _H(o;\,O)\) and \(\deg _H(o;\, R).\) We can lower-bound of the second term of the right-hand side in Eq. (3) by \({-}4k\lambda - s(v).\) Since \(O \subseteq R,\) this gives us
To upper-bound the first two terms, note that by definition of Peel, the vertex o has the smallest \(\deg _H(o;\, R) + s(o)\) among all the vertices in R. Hence,
To complete the proof, let \(O^{\prime }\) be the graph outside \(\mathcal {W},\) maximizing the gain. Due to Eq. (4), we have
Let S be the set returned by Peel.
If \(R \notin \mathcal {W},\) then \(\epsilon (R;\, \mathcal {W}) = 0.\) Moreover, R is not modified, and is one of the graphs that is tested for gain. Consequently, \( \chi \mathopen {}\left( S;\, \mathcal {W}\right) \ge \chi \mathopen {}\left( R;\, \mathcal {W}\right) ,\) proving the statement.
If \(R \in \mathcal {W},\) then it is modified by Modify to, say, \(R^{\prime }.\) Lemma 2 implies that \(5/2 \times \chi \mathopen {}\left( R^{\prime },\, \mathcal {W}\right) \ge \chi \mathopen {}\left( R;\, \mathcal {W}\right) + \epsilon (R;\, \mathcal {W}).\) Since, \( \chi \mathopen {}\left( S;\, \mathcal {W}\right) \ge \chi \mathopen {}\left( R^{\prime };\, \mathcal {W}\right) ,\) this completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Galbrun, E., Gionis, A. & Tatti, N. Top-k overlapping densest subgraphs. Data Min Knowl Disc 30, 1134–1165 (2016). https://doi.org/10.1007/s10618-016-0464-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0464-z