Combining Graph Exploration and Fragmentation for Scalable RDF Query Processing

Abstract

The flexibility offered by the Resource Description Framework (RDF) has led it to become a very popular standard for representing data with an undefined or variable schema using the concept of triples. Its success has resulted in many large scale multidisciplinary datasets, that have prompted the development of efficient RDF processing systems. Current approaches can be distinguished into two groups: the first, adopting the relational model storing the triples in tables, and the second creating data structures that model RDF data as a graph. The strategies of the first group are more easily scalable since they apply optimization strategies from the relational model like indexing and fragmentation. However, these approaches suffer many overheads when dealing with complex queries (e.g. compounded SPARQL graphs involving filters) persistent in existing applications. On the other hand, graph-based systems that use more complex data structures fail to efficiently manage the main memory and are not scalable in computer hardware with limited resources. In this paper, we propose a novel approach to perform queries (Basic Graph Patterns, Wildcards, Aggregations and Sorting) on RDF data. We propose to combine both RDF graph exploration with physical fragmentation of triples. In this work, we describe our graph-based storage and query evaluation models. Then, we detail the architecture of our system and we largely explain the strategy, based in the Volcano execution model, used to manage the main memory at query runtime. We conducted extensive experiments on synthetic and real datasets to evaluate the efficiency of our proposal. We compared our performance with a relational-based (Virtuoso), a graph-based (gStore) and an intensive-indexing (RDF-3X) approach. According to our evaluation, our system offers the best compromise between efficient query processing and scalability.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    https://github.com/bio2rdf/bio2rdf-scripts/wiki

  2. 2.

    http://wiki.dbpedia.org

  3. 3.

    Similar to SQL queries with Wildcards characters

  4. 4.

    Subject Predicate Object

  5. 5.

    Queries with variable predicates can be answered by query rewriting

  6. 6.

    https://hadoop.apache.org

  7. 7.

    https://hbase.apache.org/

  8. 8.

    In the rest of this paper we use the word graph fragment instead of characteristic sets to design the physical split of SPO or OPS.

  9. 9.

    The set of predicates related to subject (in the case of SPO fragment) or objects (in the case of OPS fragment

  10. 10.

    ϕ is used to denote an empty element

  11. 11.

    http://graphdb.ontotext.com/

  12. 12.

    https://github.com/pkumod/gStore

  13. 13.

    https://github.com/openlink/virtuoso-opensource

  14. 14.

    Queries list: https://www.lias-lab.fr/~amesmoudi/papers/ISF2020/Queries.pdf

References

  1. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K. (2007). Scalable semantic web data management using vertical partitioning. In Proceedings of the 33rd international conference on very large data bases (pp. 411–422): VLDB Endowment.

  2. Aït-Kaci, H., Boyer, R., Lincoln, P , Nasr, R. (1989). Efficient implementation of lattice operations. ACM Transactions on Programming Languages and Systems (TOPLAS), 11(1), 115–146.

    Article  Google Scholar 

  3. Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M. (2016). Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB Journal, 25(3), 355–380.

    Article  Google Scholar 

  4. Atre, M., Srinivasan, J., Hendler, Bitmat. (2008). Bitmat: a main-memory bit matrix of RDF triples for conjunctive triple pattern queries. In Proceedings of the poster and demonstration session at the 7th international semantic web conference (ISWC2008), Karlsruhe, Germany, October 28.

  5. Briggs, M. (2012). Db2 nosql graph store what why & overview.

  6. Broekstra, J., Kampman, A., van Harmelen, F. (2002). Sesame: a generic architecture for storing and querying RDF and RDF schema. In The semantic web - ISWC, first international semantic web conference, Italy, June 9-12 (pp. 54–68).

  7. Cyganiak, R. (2005). A relational algebra for sparql. Digital Media Systems Laboratory HP Laboratories Bristol. HPL-2005-170, p. 35.

  8. Deppisch, U. (1986). S-tree: a dynamic balanced signature index for office retrieval. In Proceedings of the 9th annual international ACM SIGIR conference on research and development in information retrieval (pp. 77–87): ACM.

  9. Du, J., Wang, H., Ni, Y., Hadooprdf, Y.Yu. (2012). A scalable semantic data analytical engine. In Intelligent computing theories and applications - 8th international conference, ICIC, China, July 25-29 (pp. 633–641).

  10. Erling, O. (2012). Virtuoso, a hybrid rdbms/graph column store. IEEE Data Engineering Bulletin, 35(1), 3–8.

    Google Scholar 

  11. Fuentes-Lorenzo, D., Morato, J., Gómez, J.M. (2009). Knowledge management in biomedical libraries: a semantic web approach. Information Systems Frontiers, 11(4), 471–480.

    Article  Google Scholar 

  12. Galicia, J., Mesmoudi, A., Bellatreche, L. (2019). Rdfpartsuite: bridging physical and logical RDF partitioning. In Big data analytics and knowledge discovery - 21st international conference, DaWaK 2019, Linz, Austria, August 26-29, 2019, Proceedings (pp. 136–150).

  13. Görlitz, O., & Staab, S. (2011). SPLENDID: SPARQL endpoint federation exploiting VOID descriptions. In Proceedings of the second international workshop on consuming linked data, Bonn, Germany, October 23.

  14. Graefe. G. (1994). Volcano - an extensible and parallel query evaluation system. IEEE Transactions on Knowledge and Data Engineering, 6(1), 120–135.

    Article  Google Scholar 

  15. Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M. (2014). Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In SIGMOD, USA, June 22-27 (pp. 289–300).

  16. Huang, J., Abadi, D.J., Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. PVLDB, 4 (11), 1123–1134.

    Google Scholar 

  17. Janik, M., & Kochut, K. (2005). BRAHMS: a workbench RDF store and high performance memory system for semantic association discovery. In The semantic web - ISWC 2005, 4th international semantic web conference, ISWC, Galway, Ireland, November 6-10, 2005, Proceedings (pp. 431–445).

  18. Karypis, G., & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal of Scientific Computing, 20(1), 359–392.

    Article  Google Scholar 

  19. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al. (2015). Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.

    Article  Google Scholar 

  20. McBride, B. (2002). Jena: a semantic web toolkit. IEEE Internet Computing, 6, 55–59.

    Article  Google Scholar 

  21. Mouzakitis, S., Papaspyros, D., Petychakis, M., Koussouris, S., Zafeiropoulos, A., Fotopoulou, E., Farid, L., Orlandi, F., Attard, J., Psarras, J. (2017). Challenges and opportunities in renovating public sector information by enabling linked data and analytics. Information Systems Frontiers, 19(2), 321–336.

    Article  Google Scholar 

  22. Neumann, T., & Moerkotte, G. (2011). Characteristic sets: accurate cardinality estimation for rdf queries with multiple joins. In Data Engineering (ICDE) (pp. 984–994).

  23. Neumann, T., & Weikum, G. (2008). Rdf-3x: a risc-style engine for rdf. Proceedings of the VLDB Endowment, 1(1), 647–659.

    Article  Google Scholar 

  24. Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N. (2013). H2RDF+: high-performance distributed joins over large-scale RDF graphs. In Proceedings of the 2013 IEEE international conference on big data (pp. 255–263). USA.

  25. Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D. (2016). Processing SPARQL queries over distributed RDF graphs. VLDB Journal, 25(2), 243–268.

    Article  Google Scholar 

  26. Pérez, J., Arenas, M., Gutierrez, C. (2006). Semantics and complexity of sparql. In International Semantic Web Conference, (Vol. 4273 pp. 30–43): Springer.

  27. Rohloff, K., & Schantz, R.E. (2011). Clause-iteration with mapreduce to scalably query datagraphs in the SHARD graph-store. In DIDC’11, Proceedings of the fourth international workshop on data-intensive distributed computing (pp. 35–44). San Jose.

  28. Saleem, M., & Ngomo, A.N. (2014). Hibiscus: hypergraph-based source selection for SPARQL endpoint federation. In The semantic web: trends and challenges - 11th international conference, ESWC, Anissaras, Crete, Greece, May 25-29, 2014. Proceedings (pp. 176–191).

  29. Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G. (2011). Pigsparql: mapping SPARQL to pig latin. In Proceedings of the international workshop on semantic web information management, SWIM (p. 4). Greece.

  30. Schätzle, A., Przyjaciel-Zablocki, M., Berberich, T., Lausen, G. (2015). S2X: graph-parallel querying of RDF with graphx. In Biomedical data management and graph online querying - VLDB 2015 workshops (pp. 155–168).

  31. Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G. (2016). S2RDF: RDF querying with SPARQL on spark. PVLDB, 9(10), 804–815.

    Google Scholar 

  32. Stephan, E.G., Elsethagen, T., Berg, L.K., Macduff, M.C., Paulson, P.R., Shaw, W.J., Sivaraman, C., Smith, W., Wynne, A. (2016). Semantic catalog of things, services, and data to support a wind data management facility. Information Systems Frontiers, 18(4), 679–691.

    Article  Google Scholar 

  33. Udrea, O., Pugliese, A., Subrahmanian, V.S. (2007). GRIN: a graph based RDF index. In Proceedings of the twenty-second AAAI conference on artificial intelligence, July 22-26, Vancouver, British Columbia, Canada (pp. 1465–1470).

  34. W3C. (2014). Rdf 1.1 concepts and abstract syntax. https://www.w3.org/TR/rdf11-concepts/, https://www.w3.org/TR/rdf-sparql-query/.

  35. Weiss, C., Karras, P., Bernstein, A. (2008). Hexastore: sextuple indexing for semantic web data management. Proceedings of VLDB, 1(1), 1008–1019.

    Article  Google Scholar 

  36. Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D. (2003). Efficient RDF storage and retrieval in jena2. In Proceedings of SWDB’03, the first international workshop on semantic web and databases, co-located with VLDB 2003, Humboldt-Universitȧt, Berlin, Germany, September 7-8 (pp. 131–150).

  37. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z. (2013). A distributed graph engine for web scale RDF data. PVLDB, 6(4), 265–276.

    Google Scholar 

  38. Zou, L., Mo, J., Chen, L., Özsu, M.T., Zhao, D. (2011). gstore: answering sparql queries via subgraph matching. Proceedings of the VLDB Endowment, 4(8), 482–493.

    Article  Google Scholar 

  39. Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D. (2014). gstore: a graph-based SPARQL query engine. VLDB Journal, 23(4), 565–590.

    Article  Google Scholar 

  40. Zouaghi, I., Mesmoudi, A., Galicia, J., Bellatreche, L., Aguili, T. (2020). Query optimization for large scale clustered rdf data. In 22nd international workshop on design, optimization, languages and analytical processing of big data, March 30, 2020. Copenhagen.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Amin Mesmoudi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Khelil, A., Mesmoudi, A., Galicia, J. et al. Combining Graph Exploration and Fragmentation for Scalable RDF Query Processing. Inf Syst Front 23, 165–183 (2021). https://doi.org/10.1007/s10796-020-09998-z

Download citation

Keywords

  • RDF
  • Graph exploration
  • Fragmentation
  • Scalability
  • Performance