Data Mining and Knowledge Discovery

, Volume 30, Issue 5, pp 1024–1052 | Cite as

A distributed approach for graph mining in massive networks



We propose a novel distributed algorithm for mining frequent subgraphs from a single, very large, labeled network. Our approach is the first distributed method to mine a massive input graph that is too large to fit in the memory of any individual compute node. The input graph thus has to be partitioned among the nodes, which can lead to potential false negatives. Furthermore, for scalable performance it is crucial to minimize the communication among the compute nodes. Our algorithm, DistGraph, ensures that there are no false negatives, and uses a set of optimizations and efficient collective communication operations to minimize information exchange. To our knowledge DistGraph is the first approach demonstrated to scale to graphs with over a billion vertices and edges. Scalability results on up to 2048 IBM Blue Gene/Q compute nodes, with 16 cores each, show very good speedup.


Parallel graph mining Distributed graph mining Single large graph Frequent subgraph mining High performance computing 



This work was supported by NSF Award IIS-1302231. We thank Chris Carothers and Bulent Yener for several discussions on the practical and theoretical aspects of our distributed algorithm.


  1. Afrati FN, Fotakis D, Ullman JD (2013) Enumerating subgraph instances using map-reduce. In: IEEE international conference on data engineeringGoogle Scholar
  2. Bhuiyan M, Al Hasan M (2015) An iterative mapreduce based frequent subgraph mining algorithm. IEEE Trans Knowl Data Eng 27(3):608–620CrossRefGoogle Scholar
  3. Bringmann B, Nijssen S (2008) What is frequent in a single graph? In: Pacific-Asia conference on advances in knowledge discovery and data miningGoogle Scholar
  4. Buehrer G, Parthasarathy S, Chen Y-K (2006) Adaptive parallel graph mining for cmp architectures. In: IEEE international conference on data miningGoogle Scholar
  5. Elseidy M, Abdelhamid E, Skiadopoulos S, Kalnis P (2014) Grami: frequent subgraph and pattern mining in a single large graph. Proc VLDB Endow 7:517–528CrossRefGoogle Scholar
  6. Fatta GD, Berthold MR (2006) Dynamic load balancing for the distributed mining of molecular structures. IEEE Trans Parallel Distrib Syst 17(8):773–785CrossRefGoogle Scholar
  7. Hill S, Srichandan B, Sunderraman R (2012) An iterative mapreduce approach to frequent subgraph mining in biological datasets. In: ACM conference on bioinformatics, computational biology and biomedicineGoogle Scholar
  8. Holder LB, Cook DJ (1993) Discovery of inexact concepts from structural data. IEEE Trans Knowl Data Eng 5(6):992–994CrossRefGoogle Scholar
  9. Huan J, Wang W, Prins J(2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: IEEE international conference on data miningGoogle Scholar
  10. Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: Principles of data mining and knowledge discovery. LNCS vol. 1910. Springer, pp 13–23Google Scholar
  11. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392MathSciNetCrossRefMATHGoogle Scholar
  12. Kessl R, Talukder N, Anchuri P, Zaki MJ (2014) Parallel graph mining with GPUs. Proceedings of the BigMine workshop (ACM SIGKDD), Journal of Machine Learning Research: conference and workshop proceedings, pp 36:1–16Google Scholar
  13. Kimelfeld B, Kolaitis PG (2014) The complexity of mining maximal frequent subgraphs. ACM Trans Database Syst (TODS) 39(4):32MathSciNetCrossRefGoogle Scholar
  14. Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: IEEE international conference on data miningGoogle Scholar
  15. Kuramochi M, Karypis G (2005) Finding frequent patterns in a large sparse graph. Data Min Knowl Discov 11(3):243–271MathSciNetCrossRefGoogle Scholar
  16. Lin W, Xiao X, Ghinita G (2014) Large-scale frequent subgraph mining in mapreduce. In: IEEE international conference on data engineeringGoogle Scholar
  17. Liu Y, Jiang X, Chen H, Ma J, Zhang X (2009) Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In: Advanced parallel processing technologies, LNCS vol. 5737. Springer, pp 341–355Google Scholar
  18. Lu W, Chen G, Tung A, Zhao F(2013) Efficiently extracting frequent subgraphs using mapreduce. In: IEEE international conference on big dataGoogle Scholar
  19. Meinl T, Wörlein M, Fischer I, Philippsen M (2006) Mining molecular datasets on symmetric multiprocessor systems. In: IEEE international conference on systems, man and cybernetics, vol 2Google Scholar
  20. Reinhardt S, Karypis G (2007) A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph. In: IEEE international parallel and distributed processing symposiumGoogle Scholar
  21. Shahrivari S, Jalili S (2015) Distributed discovery of frequent subgraphs of a network using MapReduce. Computing 97(11):1101–1120MathSciNetCrossRefMATHGoogle Scholar
  22. Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N (2014) Parallel subgraph listing in a large-scale graph. In: ACM SIGMOD international conference on management of dataGoogle Scholar
  23. Sun Z, Wang H, Wang H, Shao B, Li J (2012) Efficient subgraph matching on billion node graphs. Proc VLDB Endow 5(9):788–799CrossRefGoogle Scholar
  24. Teixeira CHC, Fonseca AJ, Serafini M, Siganos G, Zaki MJ, Aboulnaga A (2015) Arabesque: a system for distributed graph pattern mining. In: 25th ACM symposium on operating systems principlesGoogle Scholar
  25. Ucar D, Asur S, Catalyurek U, Parthasarathy S (2006) Improving functional modularity in protein–protein interactions graphs using hub-induced subgraphs. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) Knowledge discovery in databases: PKDD 2006. Springer, Berlin, pp 371–382CrossRefGoogle Scholar
  26. Wu B, Bai Y (2010) An efficient distributed subgraph mining algorithm in extreme large graphs. In: International conference on artificial intelligence and computational intelligence: part IGoogle Scholar
  27. Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: IEEE international conference on data miningGoogle Scholar
  28. Yang G (2004) The complexity of mining maximal frequent itemsets and maximal frequent patterns. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 344–353Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.Rensselaer Polytechnic InstituteTroyUSA

Personalised recommendations