A distributed approach for graph mining in massive networks

Talukder, N.; Zaki, M. J.

doi:10.1007/s10618-016-0466-x

A distributed approach for graph mining in massive networks

Published: 09 June 2016

Volume 30, pages 1024–1052, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

1663 Accesses
51 Citations
Explore all metrics

Abstract

We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed Graph Clustering Using Modularity and Map Equation

CCFinder: using Spark to find clustering coefficient in big graphs

Article 12 April 2017

Distributed discovery of frequent subgraphs of a network using MapReduce

Article 26 February 2015

References

Afrati FN, Fotakis D, Ullman JD (2013) Enumerating subgraph instances using map-reduce. In: IEEE international conference on data engineering
Bhuiyan M, Al Hasan M (2015) An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans Knowl Data Eng 27(3):608–620
Article Google Scholar
Bringmann B, Nijssen S (2008) What is frequent in a single graph? In: Pacific-Asia conference on advances in knowledge discovery and data mining
Buehrer G, Parthasarathy S, Chen Y-K (2006) Adaptive parallel graph mining for cmp architectures. In: IEEE international conference on data mining
Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P (2014) Grami: frequent subgraph and pattern mining in a single large graph. Proc VLDB Endow 7:517–528
Article Google Scholar
Fatta GD, Berthold MR (2006) Dynamic load balancing for the distributed mining of molecular structures. IEEE Trans Parallel Distrib Syst 17(8):773–785
Article Google Scholar
Hill S, Srichandan B, Sunderraman R (2012) An iterative mapreduce approach to frequent subgraph mining in biological datasets. In: ACM conference on bioinformatics, computational biology and biomedicine
Holder LB, Cook DJ (1993) Discovery of inexact concepts from structural data. IEEE Trans Knowl Data Eng 5(6):992–994
Article Google Scholar
Huan J, Wang W, Prins J(2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: IEEE international conference on data mining
Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Principles of data mining and knowledge discovery. LNCS vol. 1910. Springer, pp 13–23
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Article MathSciNet MATH Google Scholar
Kessl R, Talukder N, Anchuri P, Zaki MJ (2014) Parallel graph mining with GPUs. Proceedings of the BigMine workshop (ACM SIGKDD), Journal of Machine Learning Research: conference and workshop proceedings, pp 36:1–16
Kimelfeld B, Kolaitis PG (2014) The complexity of mining maximal frequent subgraphs. ACM Trans Database Syst (TODS) 39(4):32
Article MathSciNet Google Scholar
Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: IEEE international conference on data mining
Kuramochi M, Karypis G (2005) Finding frequent patterns in a large sparse graph. Data Min Knowl Discov 11(3):243–271
Article MathSciNet Google Scholar
Lin W, Xiao X, Ghinita G (2014) Large-scale frequent subgraph mining in mapreduce. In: IEEE international conference on data engineering
Liu Y, Jiang X, Chen H, Ma J, Zhang X (2009) Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In: Advanced parallel processing technologies, LNCS vol. 5737. Springer, pp 341–355
Lu W, Chen G, Tung A, Zhao F(2013) Efficiently extracting frequent subgraphs using mapreduce. In: IEEE international conference on big data
Meinl T, Wörlein M, Fischer I, Philippsen M (2006) Mining molecular datasets on symmetric multiprocessor systems. In: IEEE international conference on systems, man and cybernetics, vol 2
Reinhardt S, Karypis G (2007) A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In: IEEE international parallel and distributed processing symposium
Shahrivari S, Jalili S (2015) Distributed discovery of frequent subgraphs of a network using MapReduce. Computing 97(11):1101–1120
Article MathSciNet MATH Google Scholar
Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N (2014) Parallel subgraph listing in a large-scale graph. In: ACM SIGMOD international conference on management of data
Sun Z, Wang H, Wang H, Shao B, Li J (2012) Efficient subgraph matching on billion node graphs. Proc VLDB Endow 5(9):788–799
Article Google Scholar
Teixeira CHC, Fonseca AJ, Serafini M, Siganos G, Zaki MJ, Aboulnaga A (2015) Arabesque: a system for distributed graph pattern mining. In: 25th ACM symposium on operating systems principles
Ucar D, Asur S, Catalyurek U, Parthasarathy S (2006) Improving functional modularity in protein–protein interactions graphs using hub-induced subgraphs. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 371–382
Chapter Google Scholar
Wu B, Bai Y (2010) An efficient distributed subgraph mining algorithm in extreme large graphs. In: International conference on artificial intelligence and computational intelligence: part I
Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: IEEE international conference on data mining
Yang G (2004) The complexity of mining maximal frequent itemsets and maximal frequent patterns. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 344–353

Download references

Acknowledgments

This work was supported by NSF Award IIS-1302231. We thank Chris Carothers and Bulent Yener for several discussions on the practical and theoretical aspects of our distributed algorithm.

Author information

Authors and Affiliations

Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
N. Talukder & M. J. Zaki

Authors

N. Talukder
View author publications
You can also search for this author in PubMed Google Scholar
M. J. Zaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. J. Zaki.

Additional information

Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Talukder, N., Zaki, M.J. A distributed approach for graph mining in massive networks. Data Min Knowl Disc 30, 1024–1052 (2016). https://doi.org/10.1007/s10618-016-0466-x

Download citation

Received: 14 November 2015
Accepted: 23 May 2016
Published: 09 June 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10618-016-0466-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A distributed approach for graph mining in massive networks

Abstract

Access this article

Similar content being viewed by others

Distributed Graph Clustering Using Modularity and Map Equation

CCFinder: using Spark to find clustering coefficient in big graphs

Distributed discovery of frequent subgraphs of a network using MapReduce

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A distributed approach for graph mining in massive networks

Abstract

Access this article

Similar content being viewed by others

Distributed Graph Clustering Using Modularity and Map Equation

CCFinder: using Spark to find clustering coefficient in big graphs

Distributed discovery of frequent subgraphs of a network using MapReduce

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation