Skip to main content

Document Based RDF Storage Method for Efficient Parallel Query Processing

  • Conference paper
  • First Online:
  • 901 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 846))

Abstract

In this paper, we investigate the problem of efficiently evaluating SPARQL queries, over large amount of linked data utilizing distributed NoSQL system. We propose an efficient approach for partitioning large linked data graphs using distributed frameworks (MapReduce), as well as an effective data model for storing linked data in a document database using a maximum replication factor of 2 (i.e., in the worst case scenario, the data graph will be doubled in storage size). The model proposed and the partitioning approach ensure high-performance query evaluation and horizontal scaling for the type of queries called generalized star queries (i.e., queries allowing both subject-object and object-subject edges from a central node), due to the fact that no joining operations over multiple datasets are required to evaluate the queries. Furthermore, we present an implementation of our approach using MongoDB and an algorithm for translating generalized star queries into MongoDB query language, based on the proposed data model.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    In this paper we do not consider typed literals.

  2. 2.

    Note that not all variables of Q necessarily appear in the output pattern O(Q) of Q.

References

  1. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_16

    Chapter  Google Scholar 

  2. Apache Jena. https://jena.apache.org/

  3. Virtuoso Universal Server. https://virtuoso.openlinksw.com/

  4. Rohloff, K., Schantz, R.E.: Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store. In: 4th International Workshop on Data-Intensive Distributed Computing, DIDC 2011, pp. 35–44 (2011)

    Google Scholar 

  5. Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: mapping SPARQL to Pig Latin. In: SWIM 2011, pp. 4:1–4:8. ACM (2011)

    Google Scholar 

  6. Du, J.-H., Wang, H.-F., Ni, Y., Yu, Y.: HadoopRDF: a scalable semantic data analytical engine. In: Huang, D.-S., Ma, J., Jo, K.-H., Gromiha, M.M. (eds.) ICIC 2012. LNCS (LNAI), vol. 7390, pp. 633–641. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31576-3_80

    Chapter  Google Scholar 

  7. Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: towards scalable I/O efficient SPARQL query evaluation on the cloud. In: ICDE 2013, pp. 565–576. IEEE (2013)

    Google Scholar 

  8. Han, J., Haihong, E., Le, G., Du, J.: Survey on NoSQL database. In: ICPCA 2011, pp. 363–366. IEEE (2011)

    Google Scholar 

  9. MongoDB, NoSQL Document Database. https://www.mongodb.com/

  10. Apache HBase. https://hbase.apache.org/

  11. Neo4j Graph Platform. https://neo4j.com/

  12. Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. VLDB Endow. 3(1–2), 330–339 (2010)

    Article  Google Scholar 

  13. Gallego, M.A., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD Workshop (2011)

    Google Scholar 

  14. Kalogeros, E., Gergatsoulis, M., Damigos, M.: Redundancy in linked data partitioning for efficient query evaluation. In: FiCloud 2015, pp. 497–504. IEEE (2015)

    Google Scholar 

  15. Nomikos, C., Gergatsoulis, M., Kalogeros, E., Damigos, M.: A Map-Reduce algorithm for querying linked data based on query decomposition into stars. In: Workshops of EDBT/ICDT 2014, vol. 1133, pp. 224–231. CEUR-WS (2014)

    Google Scholar 

  16. Gergatsoulis, M., Nomikos, C., Kalogeros, E., Damigos, M.: An algorithm for querying linked data using map-reduce. In: Hameurlain, A., Rahayu, W., Taniar, D. (eds.) Globe 2013. LNCS, vol. 8059, pp. 51–62. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40053-7_5

    Chapter  Google Scholar 

  17. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)

    Article  MathSciNet  Google Scholar 

  18. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: SIGMOD Conference 2008, pp. 1099–1110. ACM (2008)

    Google Scholar 

  19. Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., Koziris, N.: H2RDF+: high-performance distributed joins over large-scale RDF graphs. In: IEEE BigData 2013, pp. 255–263. IEEE (2013)

    Google Scholar 

  20. Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In: CLOUD-I (2012)

    Google Scholar 

  21. Apache Accumulo. https://accumulo.apache.org/

  22. Apache Cassandra. http://cassandra.apache.org/

  23. Amazon DynamoDB. https://aws.amazon.com/dynamodb/

  24. Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on spark. VLDB Endow. 9(10), 804–815 (2016)

    Article  Google Scholar 

  25. Apache Spark. http://spark.apache.org/

  26. Mutharaju, R., Sakr, S., Sala, A., Hitzler, P.: D-SPARQ: distributed, scalable and efficient RDF query engine. In: ISWC-PD 2013, vol. 1035, pp. 261–264, CEUR-WS (2013)

    Google Scholar 

  27. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 197–212. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_13

    Chapter  Google Scholar 

  28. Wu, B., Zhou, Y., Yuan, P., Liu, L., Jin, H.: Scalable SPARQL querying using path partitioning. In: ICDE 2015, pp. 795–806. IEEE (2015)

    Google Scholar 

  29. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  30. Apache Hadoop. http://hadoop.apache.org/

  31. Fox, A., Brewer, E.A.: Harvest, yield, and scalable tolerant systems. In: 7th Workshop on Hot Topics in Operating Systems, pp. 174–178. IEEE (1999)

    Google Scholar 

  32. eXist-db - The Open Source Native XML Database. http://exist-db.org/

  33. Apache CouchDB. http://couchdb.apache.org/

  34. JSON (JavaScript Object Notation). http://www.json.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manolis Gergatsoulis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kalogeros, E., Gergatsoulis, M., Damigos, M. (2019). Document Based RDF Storage Method for Efficient Parallel Query Processing. In: Garoufallou, E., Sartori, F., Siatri, R., Zervas, M. (eds) Metadata and Semantic Research. MTSR 2018. Communications in Computer and Information Science, vol 846. Springer, Cham. https://doi.org/10.1007/978-3-030-14401-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14401-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14400-5

  • Online ISBN: 978-3-030-14401-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics