Distributed graph cube generation using Spark framework

  • Seok Kang
  • Suan LeeEmail author
  • Jinho Kim


Graph OLAP is a technology that generates aggregates or summaries of a large-scale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased.


Distributed parallel processing Spark framework Resilient distributed dataset Graph cube Data cube Online analytical processing 



This research was supported by Korea Electric Power Corporation. (Grant Number: R18XA05) and by the Industrial Technology Innovation Program (Project#: 10052797), through the Korea Evaluation Institute of Industrial Technology (Keit), funded by the Ministry of Trade, Industry and Energy.


  1. 1.
    Thomsen E (2002) OLAP solutions: building multidimensional information systems. Wiley, New YorkGoogle Scholar
  2. 2.
    Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. ACM Sigmod Rec 26:65–74CrossRefGoogle Scholar
  3. 3.
    Beyer K and Ramakrishnan R (1999) Bottom-up computation of sparse and iceberg cube. In: ACM Sigmod RecordGoogle Scholar
  4. 4.
    Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1:29–53CrossRefGoogle Scholar
  5. 5.
    Zhao Y, Deshpande PM, Naughton JF (1997) An array-based algorithm for simultaneous multidimensional aggregates. In: ACM SIGMOD RecordGoogle Scholar
  6. 6.
    Xin D, Han J, Li X, Wah BW (2003) Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol 29Google Scholar
  7. 7.
    Xin D, Shao Z, Han J, Liu H (2006) C-cubing: efficient computation of closed cubes by aggregation-based checking. In: Proceedings of the 22nd International Conference on Data Engineering. ICDE’06Google Scholar
  8. 8.
    Ng RT, Wagner A, Yin Y (2001) Iceberg-cube computation with PC clusters. In: ACM SIGMOD recordGoogle Scholar
  9. 9.
    Han J, Pei J, Dong G, Wang K (2001) Efficient computation of iceberg cubes with complex measures. In: ACM SIGMOD recordGoogle Scholar
  10. 10.
    Fang M, Shivakumar N, Garcia-Molina H, Motwani R, Ullman JD (1998) Computing iceberg queries efficiently. In: International Conference on Very Large Databases (VLDB’98), New York, August 1998Google Scholar
  11. 11.
    Agarwal S, Agrawal R, Deshpande PM, Gupta A, Naughton JF, Ramakrishnan R, Sarawagi S (1996) On the computation of multidimensional aggregates. In: VLDBGoogle Scholar
  12. 12.
    Li X, Han J, Gonzalez H (2004) High-dimensional OLAP: a minimal cubing approach. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol 30Google Scholar
  13. 13.
    Wang Z, Chu Y, Tan K-L, Agrawal D, Abbadi AEI, Xu X (2013) Scalable data cube analysis over big data. arXiv preprint arXiv:1311.5663
  14. 14.
    Nandi A, Yu C, Bohannon P, Ramakrishnan R (2012) Data cube materialization and mining over mapreduce. IEEE Trans Knowl Data Eng 24:1747–1759CrossRefGoogle Scholar
  15. 15.
    Lee S, Jo S, Kim J (2015) MRDataCube: data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), pp 95–102Google Scholar
  16. 16.
    Milo T, Altshuler E (2016) An efficient MapReduce cube algorithm for varied DataDistributions. In: Proceedings of the 2016 International Conference on Management of DataGoogle Scholar
  17. 17.
    Lee S, Kang S, Kim J, Yu EJ (2018) Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Clust Computing 1–25.
  18. 18.
    Yin M, Wu B, Zeng Z (2012) HMGraph OLAP: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAPGoogle Scholar
  19. 19.
    Qu Q, Zhu F, Yan X, Han J, Philip SY, Li H (2011) Efficient topological OLAP on information networks. In: International Conference on Database Systems for Advanced ApplicationsGoogle Scholar
  20. 20.
    Li C, Yu PS, Zhao L, Xie Y, Lin W (2011) Infonetolaper: integrating infonetwarehouse and infonetcube with infonetolap. In: Proceedings of the VLDB Endowment, vol 4Google Scholar
  21. 21.
    Cook DJ, Holder LB (2006) Mining graph data. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  22. 22.
    Chen C, Yan X, Zhu F, Han J, Philip SY (2008) Graph OLAP: towards online analytical processing on graphs. In: Eighth IEEE International Conference on Data Mining, ICDM’08, pp 103–112Google Scholar
  23. 23.
    Beheshti SMR, Benatallah B, Motahari-Nezhad HR, Allahbakhsh M (2012) A framework and a language for on-line analytical processing on graphs. In: International Conference on Web Information Systems EngineeringGoogle Scholar
  24. 24.
    Zhao P, Li X, Xin D, Han J (2011) Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of DataGoogle Scholar
  25. 25.
    Ghrab A et al (2015) A framework for building OLAP cubes on graphs. In: East European Conference on Advances in Databases and Information Systems. Springer, ChamGoogle Scholar
  26. 26.
    Bleco D, Yannis K (2018) Finding the needle in a haystack: entropy guided exploration of very large graph cubes. In: EDBT/ICDT WorkshopsGoogle Scholar
  27. 27.
    Azirani E et al (2015) Efficient OLAP operations for RDF analytics. In: 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW). IEEEGoogle Scholar
  28. 28.
    Wang Z, Fan Q, Wang H, Tan K-L, Agrawal D, El Abbadi A (2014) Pagrol: parallel graph olap over large-scale attributed graphs. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE)Google Scholar
  29. 29.
    Denis B, Ghrab A, Skhiri S (2013) A distributed approach for graph-oriented multidimensional analysis. In: 2013 IEEE International Conference on Big DataGoogle Scholar
  30. 30.
    Spark A (2018) Apache Spark: unified analytics engine for big data. The Apache Software Foundation. Accessed 8 Jan 2019
  31. 31.
    Xin RS, Crankshaw D, Dave A, Gonzalez JE, Franklin MJ, Stoica I (2014) Graphx: unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394
  32. 32.
    Shoro AG, Soomro TR (2015) Big data analysis: Apache spark perspective. Global J Comput Sci TechnolGoogle Scholar
  33. 33.
    Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data MiningGoogle Scholar
  34. 34.
    Carlini E, Dazzi P, Esposito A, Lulli A, Ricci L (2014) Balanced graph partitioning with apache spark. In: European Conference on Parallel ProcessingGoogle Scholar
  35. 35.
    Zadeh RB, Meng X, Ulanov A, Yavuz B, Pu L, Venkataraman S, Sparks E, Staple A, Zaharia M (2016) Matrix computations and optimization in apache spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningGoogle Scholar
  36. 36.
    Yang L et al (2018) Min-forest: fast reachability indexing approach for large-scale graphs on spark platform. In: International Conference on Web Services. Springer, ChamGoogle Scholar
  37. 37.
    Lee S et al (2018) TensorLightning: a traffic-efficient distributed deep learning on commodity Spark clusters. IEEE Access 6:27671–27680CrossRefGoogle Scholar
  38. 38.
    Tian X et al (2017) Towards memory and computation efficient graph processing on spark. In: 2017 IEEE International Conference on Big Data. IEEEGoogle Scholar
  39. 39.
    Karim MR et al (2018) Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf Sci 432:278–300MathSciNetCrossRefGoogle Scholar
  40. 40.
    Jensen SK, Torben BP, Christian T (2018) ModelarDB: modular model-based time series management with spark and cassandra. Proc VLDB Endow 11(11):1688–1701CrossRefGoogle Scholar
  41. 41.
    Kim J et al (2017) Optimized combinatorial clustering for stochastic processes. Cluster Comput 20(2):1135–1148CrossRefGoogle Scholar
  42. 42.
    Alemi Mehdi, Haghighi Hassan, Shahrivari Saeed (2017) CCFinder: using Spark to find clustering coefficient in big graphs. J Supercomput 73(11):4683–4710CrossRefGoogle Scholar
  43. 43.
    Hadoop A (2018) Apache Hadoop. The Apache Software Foundation. Accessed 8 Jan 2019
  44. 44.
    Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and ImplementationGoogle Scholar
  45. 45.
    Leskovec J, Sosič R (2016) Snap: a general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol (TIST) 8(1):1CrossRefGoogle Scholar
  46. 46.
    Mühleisen H, Bizer C (2012) Web data commons—extracting structured data from two large web corpora. In: CEUR Workshop Proceedings LDOW 2012: Linked Data on the Web, vol 937. CEUR-ws.orgGoogle Scholar
  47. 47.
    Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 1383–1394Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceKangwon National UniversityChuncheonKorea

Personalised recommendations