Management and Analysis of Big Graph Data: Current Systems and Open Challenges

Junghanns, Martin; Petermann, André; Neumann, Martin; Rahm, Erhard

doi:10.1007/978-3-319-49340-4_14

Management and Analysis of Big Graph Data: Current Systems and Open Challenges

Martin Junghanns³,
André Petermann³,
Martin Neumann⁴ &
…
Erhard Rahm³

Chapter
First Online: 26 February 2017

8035 Accesses
35 Citations
1 Altmetric

Abstract

Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining capabilities, ease of use as well as high performance and scalability. In this chapter, we survey current system approaches for management and analysis of “big graph data”. We discuss graph database systems, distributed graph processing systems such as Google Pregel and its variations, and graph dataflow approaches based on Apache Spark and Flink. We further outline a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs. Finally, we discuss current and future research challenges.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://lod-cloud.net/.
2.
http://tinkerpop.apache.org/.
3.
http://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API.
4.
http://db-engines.com/en/ranking/graph+dbms.
5.
https://www.w3.org/TR/rdf-schema/#ch_reificationvocab.
6.
http://www.opencypher.org/.
7.
We use vertex compute function and vertex function interchangeably throughout this section.
8.
In its core, Flink is a distributed streaming system and provides streaming as well as batch APIs. We focus on the batch API, as Gelly is currently implemented on top of that.
9.
Flink supports further systems as data source and sink, e.g., relational and NoSQL databases or queuing systems.
10.
When implemented using a synchronous graph-processing system.
11.
The coGroup transformation groups each input dataset on one or more fields and then joins the groups.
12.
GSA is a variant of the GAS abstraction introduced by PowerGraph [41] and discussed in Sect. 3.
13.
The Neighbor class allows access to the incident edge value and the adjacent vertex value.
14.
An operator fulfills the closure property if the execution of that operator on members of an input domain results in members of the same domain.
15.
http://www.gradoop.com.
16.
http://hbase.apache.org.
17.
The betweenness centrality of a vertex is defined as the number of shortest paths in a network pathing through the vertex. A high value thus indicates that a vertex is centrally located so that it plays an important role in a network.
18.
www.mpi-inf.mpg.de/yago-naga/yago/.
19.
http://dbpedia.org/.
20.
www.wikidata.org.
21.
http://neo4j.com/graph-visualization-neo4j/.

References

C. Aggarwal, K. Subbian, Evolutionary network analysis: a survey. ACM Comput. Surv. (CSUR) 47(1), 10 (2014)
Article MATH Google Scholar
G.A. Agha, Actors: a model of concurrent computation in distributed systems Technical report, DTIC Document (1985)
Google Scholar
Akka. http://www.akka.io. Accessed 10 Mar 2016
A. Alexandrov et al., The stratosphere platform for big data analytics. VLDB J. 23(6) (2014)
Google Scholar
AllegroGraph. http://franz.com/agraph/allegrograph/. Accessed 10 Mar 2016
R. Angles, A comparison of current graph database models, in Proceedings of ICDEW (2012)
Google Scholar
R. Angles, C. Gutierrez, Survey of graph database models. ACM Comput. Surv. (CSUR) 40(1) (2008)
Google Scholar
R. Angles et al., The linked data benchmark council: a graph and RDF industry benchmarking effort. Proc. SIGMOD 43(1) (2014)
Google Scholar
Apache Flink Iteration Operators. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#iteration-operators. Accessed 09 Mar 2016
Apache Giraph. http://www.giraph.apache.org. Accessed 10 Mar 2016
Apache Jena - TBD. https://jena.apache.org/documentation/tdb/. Accessed 09 Mar 2016
T.G. Armstrong et al., Linkbench: a database benchmark based on the facebook social graph (2013)
Google Scholar
G. Bagan et al. gMark: Controlling Diversity in Benchmarking Graph Databases. CoRR abs/1511.08386 (2015)
Google Scholar
O. Batarfi et al., Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3) (2015)
Google Scholar
K. Bellare et al., Woo: a scalable and multi-tenant platform for continuous knowledge base synthesis. PVLDB 6(11) (2013)
Google Scholar
D.P. Bertsekas, J.N. Tsitsiklis, Parallel and distributed computation: numerical methods, vol. 23 (1989)
Google Scholar
Big Data Spatial and Graph User’s Guide and Reference. http://docs.oracle.com/cd/E69290_01/doc.44/e67958/toc.htm. Accessed 16 Mar 2016
H. Bolouri, Modeling genomic regulatory networks with big data. Trends Genet. 30(5) (2014)
Google Scholar
D. Brickley, L. Miller, Foaf vocabulary specification 0.98. Namespace document 9 (2012)
Google Scholar
A. Buluç et al., Recent advances in graph partitioning. CoRR (2013)
Google Scholar
M. Canim, Y.C. Chang, System G data store: big, rich graph data analytics in the cloud, in IEEE Cloud Engineering (IC2E) (March 2013)
Google Scholar
G. Carothers, RDF 1.1 N-Quads: a line-based syntax for RDF datasets. W3C Recommendation (2014)
Google Scholar
R. Cattell, Scalable SQL and NoSQL data stores. Proc. SIGMOD 39(4) (2011)
Google Scholar
C. Chen et al., Graph OLAP: towards online analytical processing on graphs, in IEEE Data Mining (ICDM) (2008)
Google Scholar
R. Cheng et al., Kineograph: taking the pulse of a fast-changing and connected world, in Proceedings of EuroSys (2012)
Google Scholar
Cypher Query Language. http://neo4j.com/docs/stable/cypher-query-lang.html. Accessed 16 Mar 2016
S. Das et al., A Tale of two graphs: property graphs as RDF in Oracle, in EDBT (2014)
Google Scholar
R. Diestel, Graph theory, Graduate Texts in Mathematics, vol. 173, 4th edn. (2012)
Google Scholar
Y. Ding, Scientific collaboration and endorsement: network analysis of coauthorship and citation networks. J. Inform. 5(1) (2011)
Google Scholar
X. Dong et al., Knowledge Vault: a web-scale approach to probabilistic knowledge fusion, in Proceedings of SIGKDD (2014)
Google Scholar
B. Elser, A. Montresor, An evaluation study of bigdata frameworks for graph processing, in IEEE Big Data (2013)
Google Scholar
O. Erling, I. Mikhailov, RDF support in the Virtuoso DBMS, in Networked Knowledge-Networked Media (2009)
Google Scholar
O. Erling et al., The ldbc social network benchmark: interactive workload, in Proceedings of SIGMOD(2015)
Google Scholar
S. Ewen et al., Spinning fast iterative data flows. PVLDB 5(11) (2012)
Google Scholar
S. Ewen et al., Iterative parallel data processing with stratosphere: an inside look, in Proceedings of SIGMOD (2013)
Google Scholar
S. Fortunato, Community detection in graphs. Phys. Rep. 486(3–5) (2010)
Google Scholar
B. Gallagher, Matching structure and semantics: a survey on graph-based pattern matching. AAAI FS 6 (2006)
Google Scholar
J. Gao et al., Glog: a high level graph analysis system using mapreduce, in Proceedings of ICDE (2014)
Google Scholar
Gelly: Flink Graph API. https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html. Accessed 15 Mar 2016
A. Ghrab et al., A framework for building OLAP cubes on graphs, in Advances in Databases and Information Systems (2015)
Google Scholar
J.E. Gonzalez et al., Powergraph: distributed graph-parallel computation on natural graphs, in Proceedings of OSDI (2012)
Google Scholar
J.E. Gonzalez et al., GraphX: graph processing in a distributed dataflow framework, in Proceedings of OSDI (2014)
Google Scholar
GraphDB: At Last, the Meaningful Database. http://ontotext.com/documents/reports/PW_Ontotext.pdf. Whitepaper July 2014
Y. Guo et al., How well do graph-processing platforms perform? An empirical performance evaluation and analysis, in Proceedings of Parallel and Distributed Processing Symposium (2014)
Google Scholar
D. Haas et al., Wisteria: nurturing scalable data cleaning infrastructure. PVLDB 8(12) (2015)
Google Scholar
T. Haerder, A. Reuter, Principles of transaction-oriented database recovery. ACM Comput. Surv. 15(4) (1983)
Google Scholar
M. Han et al., An experimental comparison of pregel-like graph processing systems. PVLDB 7(12) (2014)
Google Scholar
S. Harris, A. Seaborne, E. Prudhommeaux, SPARQL 1.1 query language. W3C Recommendation 21 (2013)
Google Scholar
O. Hartig, B. Thompson, Foundations of an alternative approach to reification in RDF. Technical Report. arXiv:1406.3399 (2014)
T. Hayashi, T. Akiba, Y. Yoshida, Fully dynamic betweenness centrality maintenance on massive networks. PVLDB 9(2) (2015)
Google Scholar
J. Huang, D.J. Abadi, LEOPARD: lightweight edge-oriented partitioning and replication for dynamic graphs. PVLDB 9(7) (2016)
Google Scholar
InfiniteGraph: The Distributed Graph Database. http://www.objectivity.com/wp-content/uploads/Objectivity_WP_IG_Distr_Benchmark.pdf. Whitepaper 2012
B. Iordanov, HyperGraphDB: a generalized graph database, in Web-Age Information Management (2010)
Google Scholar
N. Jain, G. Liao, T.L. Willke, Graphbuilder: scalable graph ETL framework, in International Workshop on Graph Data Management Experiences and Systems (2013)
Google Scholar
C. Jiang et al., A survey of Frequent Subgraph Mining algorithms. Knowl. Eng. Rev. 28(1) (2013)
Google Scholar
M. Junghanns et al., GRADOOP: Scalable Graph Data Management and Analytics with Hadoop. Technical Report. arXiv:1506.00548 (2015)
M. Junghanns et al., Analyzing extended property graphs with apache flink, in Proceedings of SIGMOD Workshop on Network Data Analytics (2016)
Google Scholar
Z. Kaoudi, I. Manolescu, RDF in the clouds: a survey. VLDB J. 24(1) (2015)
Google Scholar
G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48(1) (1998)
Google Scholar
Key Features - ArangoDB. https://www.arangodb.com/key-features/. Accessed 10 Mar 2016
Z. Khayyat et al., Mizan: a system for dynamic load balancing in large-scale graph processing, in Proceedings EuroSys (2013)
Google Scholar
Z. Khayyat et al., Bigdansing: a system for big data cleansing, in Proceedings SIGMOD (2015)
Google Scholar
G. Klyne, J.J. Carroll, Resource description framework (RDF): concepts and abstract syntax (2006)
Google Scholar
L. Kolb, A. Thor, E. Rahm, Dedoop: efficient deduplication with Hadoop. PVLDB 5(12) (2012)
Google Scholar
L. Kolb, Z. Sehili, E. Rahm, Iterative computation of connected graph components with MapReduce. Datenbank-Spektrum 14(2) (2014)
Google Scholar
D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques (2009)
Google Scholar
A. Kyrola, G. Blelloch, C. Guestrin, GraphChi: large-scale graph computation on just a PC, in Proceedings OSDI (2012)
Google Scholar
J. Lin, M. Schatz, Design patterns for efficient graph algorithms in MapReduce, in Proceedings of 8th Workshop on Mining and Learning with Graphs (2010)
Google Scholar
Y. Low et al., Distributed GraphLab: a framework for machine learning and data mining in the cloud. PVLDB 5(8) (2012)
Google Scholar
Y. Lu, J. Cheng, D. Yan, H. Wu, Large-scale distributed graph computing systems: an experimental evaluation. PVLDB 8(3) (2014)
Google Scholar
G. Malewicz et al., Pregel: a system for large-scale graph processing, in Proceedings of SIGMOD (2010)
Google Scholar
MarkLogic Semantics. http://www.marklogic.com/resources/marklogic-semantics-datasheet/. Datasheet March 2016
N. Martinez-Bazan, S. Gomez-Villamor, F. Escale-Claveras, DEX: a high-performance graph database management system, in Proceedings of ICDEW (2011)
Google Scholar
R. McColl et al., A performance evaluation of open source graph databases, in Proceedings of PPAAW (2014)
Google Scholar
R.R. McCune, T. Weninger, G. Madey, Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. (CSUR) 48(2) (2015)
Google Scholar
F. McSherry et al., Composable incremental and iterative data-parallel computation with naiad. Technical Report MSR-TR-2012-105 (October 2012)
Google Scholar
J.J. Miller, Graph database applications and concepts with Neo4j, in Proceedings of Southern Association for Information Systems Conference, vol. 2324 (2013)
Google Scholar
J. Mondal, A. Deshpande, Managing large dynamic graphs efficiently, in Proceedings of SIGMOD (2012)
Google Scholar
D.G. Murray et al., Naiad: a timely dataflow system, in Proceedings of 24th ACM Symposium on Operating Systems Principles. SOSP ’13 (2013)
Google Scholar
R. Nehme, N. Bruno, Automated partitioning design in parallel database systems, in Proceedings of SIGMOD (2011)
Google Scholar
M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A review of relational machine learning for knowledge graphs. Proc. IEEE 104(1) (2016)
Google Scholar
Oracle Spatial and Graph: Advanced Data Management. http://www.oracle.com/technetwork/database/options/spatialandgraph/spatial-and-graph-wp-12c-1896143.pdf. Whitepaper September 2014
A. Petermann et al., BIIIG: enabling business intelligence with integrated instance graphs, in Proceedings of ICDEW (2014)
Google Scholar
A. Petermann et al., FoodBroker-generating synthetic datasets for graph-based business analytics, in Big Data Benchmarking (2014)
Google Scholar
A. Petermann et al., Graph-based data integration and business intelligence with BIIIG. PVLDB 7(13) (2014)
Google Scholar
A. Poulovassilis, M. Levene, A nested-graph model for the representation and manipulation of complex objects. ACM Trans. Inform. Syst. (TOIS) 12(1) (1994)
Google Scholar
quasar. http://www.paralleluniverse.co/quasar. Accessed 10 Mar 2016
U.N. Raghavan et al., Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007)
Article Google Scholar
F. Rahimian et al., Distributed vertex-cut partitioning, in Distributed Applications and Interoperable Systems (2014)
Google Scholar
E. Rahm, The case for holistic data integration, in Advances in Databases and Information Systems (2016)
Google Scholar
J. Rao et al., Automating physical database design in a parallel database, in Proceedings of SIGMOD (2002)
Google Scholar
M.A. Rodriguez, The gremlin graph traversal machine and language (invited talk), in Proceedings of 15th Symposium on Database Programming Languages (2015)
Google Scholar
M.A. Rodriguez, P. Neubauer, Constructions from dots and lines. Bull. Am. Soc. Inform. Sci. Technol. 36(6) (2010)
Google Scholar
A. Roy et al., Chaos: scale-out graph processing from secondary storage, in Proceedings of 25th Symposium on Operating Systems Principles (2015)
Google Scholar
M. Rudolf et al., The graph story of the SAP HANA database, in Proceedings of BTW (2013)
Google Scholar
S. Sakr, A. Liu, A.G. Fayoumi, The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. (CSUR) 46(1) (2013)
Google Scholar
S. Salihoglu, J. Widom, GPS: a graph processing system, in Proceedings of 25th International Conference on Scientific and Statistical Database Management. SSDBM (2013)
Google Scholar
N. Satish et al., Navigating the maze of graph analytics frameworks using massive graph datasets, in Proceedings of SIGMOD (2014)
Google Scholar
K. Shim, MapReduce algorithms for big data analysis. PVLDB 5(12) (2012)
Google Scholar
I. Stanton, G. Kliot, Streaming graph partitioning for large distributed graphs, in Proceedings of SIGKDD
Google Scholar
Stardog 4 - The Manual. http://docs.stardog.com/. Accessed 10 Mar 2016
P. Stutz, A. Bernstein, W. Cohen, Signal/collect: graph algorithms for the (semantic) web, in ISWC (2010)
Google Scholar
W. Sun et al., SQLGraph: an efficient relational-based property graph store, in Proceedings of SIGMOD (2015)
Google Scholar
C. Teixeira et al., Arabesque: a system for distributed graph mining, in Proceedings of 25th Symposium on Operating Systems Principles (2015)
Google Scholar
The bigdata RDF Database. https://www.blazegraph.com/whitepapers/bigdata_architecture_whitepaper.pdf. Whitepaper May 2013
Y. Tian, R.A. Hankins, J.M. Patel, Efficient aggregation for graph summarization, in Proceedings of SIGMOD (2008)
Google Scholar
Y. Tian et al., From “Think Like a Vertex” to “Think Like a Graph”. PVLDB 7(3) (2013)
Google Scholar
TITAN: Distributed Graph Database. http://thinkaurelius.github.io/titan/. Accessed 10 Mar 2016
N.B. Turk-Browne, Functional interactions as big data in the human brain. Science 342(6158) (2013)
Google Scholar
L.G. Valiant, A bridging model for parallel computation. CACM 33(8) (1990)
Google Scholar
X.H. Wang et al., Ontology based context modeling and reasoning using owl, in Pervasive Computing and Communications Workshops (2004)
Google Scholar
Z. Wang et al., Pagrol: parallel graph olap over large-scale attributed graphs, in Proceedings of ICDE (2014)
Google Scholar
Why OrientDB? http://orientdb.com/why-orientdb/. Accessed 10 Mar 2016
Y. Xia et al., Graph analytics and storage, in IEEE Big Data (2014)
Google Scholar
R.S. Xin et al., GraphX: a resilient distributed graph system on spark, in First International Workshop on Graph Data Management Experiences and Systems. GRADES ’13 (2013)
Google Scholar
R.S. Xin et al., GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. Technical Report. arxiv:1402.2394 (2014)
P. Yuan et al., Triplebit: a fast and compact system for large scale rdf data. PVLDB 6(7) (2013)
Google Scholar
M. Zaharia et al., Spark: cluster computing with working sets, in Proceedings of 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10 (2010)
Google Scholar
N. Zhang, Y. Tian, J.M. Patel, Discovery-driven graph summarization, in Proceedings of ICDE (2010)
Google Scholar
P. Zhao et al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of SIGMOD (2011)
Google Scholar
Y. Zhao et al., Evaluation and analysis of distributed graph-parallel processing frameworks. J. Cyber Secur. Mobil. 3(3) (2014)
Google Scholar

Download references

Acknowledgements

This work is partially funded by the German Federal Ministry of Education and Research under project ScaDS Dresden/Leipzig (BMBF 01IS14014B).

Author information

Authors and Affiliations

Database Research Group, Leipzig University, Leipzig, Germany
Martin Junghanns, André Petermann & Erhard Rahm
Swedish Institute of Computer Science, Kista, Sweden
Martin Neumann

Authors

Martin Junghanns
View author publications
You can also search for this author in PubMed Google Scholar
André Petermann
View author publications
You can also search for this author in PubMed Google Scholar
Martin Neumann
View author publications
You can also search for this author in PubMed Google Scholar
Erhard Rahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Junghanns .

Editor information

Editors and Affiliations

School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya
The School of Computer Science, The University of New South Wales, Eveleigh, New South Wales, Australia
Sherif Sakr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Junghanns, M., Petermann, A., Neumann, M., Rahm, E. (2017). Management and Analysis of Big Graph Data: Current Systems and Open Challenges. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-49340-4_14
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics