A data distribution model for RDF

Abstract

The ever-increasing amount of RDF data made available requires data to be partitioned across multiple servers. We have witnessed some research progress made towards scaling RDF query processing based on suitable data distribution methods. In general, they work well for queries matching simple triple patterns, but they are not efficient for queries involving more complex patterns. In this paper, we present an RDF data distribution method which overcomes the shortcomings of the current approaches in order to scale RDF storage both on the volume of data and query processing. We apply a method that identifies frequent patterns accessed by queries in order to keep related data in the same partition. We deploy our reasoning on a summarized view of data in order to avoid exhaustive analysis on large datasets. As result, partitioning templates are obtained from data items in an RDF structure. In addition, we provide an approach for dynamic data insertions even if new data do not conform to the original RDF structure. Apart from the repartitioning approaches, we use an overflow repository to store data which may not follow the original schema. Our study shows that our method scales well and is effective to improve the overall performance by decreasing the amount of message passing among servers, compared to alternative data distribution approaches for RDF.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Notes

  1. 1.

    http://wiki.dbpedia.org/Datasets.

  2. 2.

    http://www.w3.org/wiki/LargeTripleStores.

References

  1. 1.

    Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18(2), 385–406 (2009). https://doi.org/10.1007/s00778-008-0125-y

    Article  Google Scholar 

  2. 2.

    Agrawal, S., Narasayya, V., Yang, B.: Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 359–370 (2004). https://doi.org/10.1145/1007568.1007609

  3. 3.

    Aluç, G., Özsu, M.T., Daudjee, K.: Building self-clustering RDF databases using Tunable-LSH. VLDB J. 28, 173–195 (2018)

    Article  Google Scholar 

  4. 4.

    Bellatreche, L., Bouchakri, R., Cuzzocrea, A., Maabout, S.: Horizontal partitioning of very-large data warehouses under dynamically-changing query workloads via incremental algorithms. In: Proceedings of ACM Symposium on Applied Computing, pp. 208–210 (2013)

  5. 5.

    Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. 5(2), 1–24 (2009). https://doi.org/10.4018/jswis.2009040101

    Article  Google Scholar 

  6. 6.

    Bok, K., Kim, C., Jeong, J., Lim, J., Yoo, J.: Dynamic partitioning of large scale RDF graph in dynamic environments. In: Lee, W., Choi, W., Jung, S., Song, M. (eds) Proceedings of the 7th International Conference on Emerging Databases, pp. 43–49 (2018). https://doi.org/10.1007/978-981-10-6520-0_5

  7. 7.

    Bordawekar, R., Shmueli, O.: An algorithm for partitioning trees augmented with sibling edges. Inf. Process. Lett. 108(3), 136–142 (2008). https://doi.org/10.1016/j.ipl.2008.04.010

    MathSciNet  Article  MATH  Google Scholar 

  8. 8.

    Cong, G., Fan, W., Kementsietsidis, A.: Distributed query evaluation with performance guarantees. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 509–520. ACM Press, New York (2007). https://doi.org/10.1145/1247480.1247537

  9. 9.

    Cruz, F., Maia, F., Matos, M., Oliveira, R., Paulo, J., Pereira , J., Vilaça, R.: MeT: workload aware elasticity for NoSQL. In: ACM European Conference on Computer Systems, pp. 183–196 (2013). https://doi.org/10.1145/2465351.2465370

  10. 10.

    Curino, C., Jones, E., Zhang, Y., Madden, S.: Schism: a workload-driven approach to database replication and partitioning. Proc. VLDB Endow. 3(1–2), 48–57 (2010). https://doi.org/10.14778/1920841.1920853

    Article  Google Scholar 

  11. 11.

    Feng, J., Meng, C., Song, J., Zhang, X., Feng, Z., Zou, L.: SPARQL query parallel processing: a survey. In: 2017 IEEE International Congress on Big Data (BigData Congress), pp. 444–451 (2017). https://doi.org/10.1109/BigDataCongress.2017.65

  12. 12.

    Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: ICDE Workshop: Data Engineering Meets the Semantic Web, pp. 1–6 (2013). https://doi.org/10.1109/ICDEW.2013.6547414

  13. 13.

    Jiewen Huang, D.J.A.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)

    Google Scholar 

  14. 14.

    Madkour, A., Aly, A.M., Aref, W.G.: WORQ: Workload-driven RDF query processing. Semant. Web ISWC 2018, 583–599 (2018)

    Google Scholar 

  15. 15.

    METIS: Family of Graph and Hypergraph Partitioning Software (2018). URL http://glaros.dtc.umn.edu/gkhome/views/metis

  16. 16.

    Navathe, S., Ra, M.: Vertical partitioning for database design: a graphical algorithm. In: Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, vol. 18, pp. 440–450 (1989). https://doi.org/10.1145/67544.66966

  17. 17.

    Nejdl, W., Siberski, W., Sintek, M.: Design issues and challenges for RDF and schema-based peer-to-peer systems. ACM SIGMOD Rec. 32(3), 41–46 (2003). https://doi.org/10.1145/945721.945731

    Article  Google Scholar 

  18. 18.

    Neumann, T., Moerkotte, G.: Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In: IEEE 27th International Conference on Data Engineering (ICDE), pp. 984–994 (2011). https://doi.org/10.1109/ICDE.2011.5767868

  19. 19.

    Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Prentice-Hall, New York (1991)

    Google Scholar 

  20. 20.

    Pavlo, A., Curino, C., Zdonik, S.: Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 61–72 (2012). https://doi.org/10.1145/2213836.2213844

  21. 21.

    Penteado, R.R.M.: Otimização de Consultas SPARQL em Bases RDF Distribuídas. PhD thesis, Universidade Federal do Paraná (2017)

  22. 22.

    Pham, M.: Self-organizing structured RDF in MonetDB. In: Data Engineering Workshops (ICDEW), 2013 IEEE 29th International Conference on, pp. 310–313 (2013). https://doi.org/10.1109/ICDEW.2013.6547471

  23. 23.

    Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: Scalable workload-aware data placement for transactional workloads. In: Proceedings of the 16th International Conference on Extending Database Technology, pp. 430–441 (2013). https://doi.org/10.1145/2452376.2452427

  24. 24.

    Schroeder, R., Hara, C.S.: Partitioning templates for RDF. In: Advances in Databases and Information Systems, Poitiers, France, pp. 305–319 (2015). https://doi.org/10.1007/978-3-319-23135-8_21

  25. 25.

    Schroeder, R., Mello, R., Hara, C.: Affinity-based XML Fragmentation. In: International Workshop on the Web and Databases (2012). URL http://db.disi.unitn.eu/pages/WebDB2012/papers/p23.pdf

  26. 26.

    Schütt, T., Schintke, F., Reinefeld, A.: Scalaris: reliable transactional P2P key/value store. In: ACM SIGPLAN Workshop on ERLANG, pp. 41–48 (2008). https://doi.org/10.1145/1411273.1411280

  27. 27.

    Shanbhag, A., Jindal, A., Madden, S., Quiane, J., Elmore, A.J.: A robust partitioning scheme for ad-hoc query workloads. In: Proceedings of the 2017 Symposium on Cloud Computing, pp. 229–241 (2017). https://doi.org/10.1145/3127479.3131613

  28. 28.

    Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: IEEE 29th International Conference on Data Engineering, pp. 553–564 (2013). https://doi.org/10.1109/ICDE.2013.6544855

  29. 29.

    Shute, J., Whipkey, C., Menestrina, D., Vingralek, R., Samwel, B., Handy, B., Rollins, E., Oancea, M., Littlefield, K., Ellner, S., Cieslewicz, J., Rae, I., Stancescu, T., Apte, H.: F1: a distributed SQL database that scales. Proc. VLDB Endow. 6(11), 1068–1079 (2013). https://doi.org/10.14778/2536222.2536232

    Article  Google Scholar 

  30. 30.

    Vazirani, V.V.: Approximation Algorithms. Springer, Berlin (2003)

    Google Scholar 

  31. 31.

    Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: IEEE 30th International Conference on Data Engineering (ICDE), pp. 568–579 (2014). https://doi.org/10.1109/ICDE.2014.6816682

  32. 32.

    Xiong, P.: Dynamic management of resources and workloads for RDBMS in cloud: a control-theoretic approach. In: Proceedings of the on SIGMOD/PODS 2012 PhD Symposium, pp. 63–68. ACM, New York (2012). https://doi.org/10.1145/2213598.2213614

  33. 33.

    Yang, M., Wu, G.: A workload-based partitioning scheme for parallel RDF data processing. In: Semantic Web and Web Science, Springer Proceedings in Complexity, pp. 311–324. Springer, New York (2013). https://link.springer.com/chapter/10.1007/978-1-4614-6880-6_27

  34. 34.

    Yang, T., Chen, J., Wang, X., Chen, Y., Du, X.: Efficient SPARQL query evaluation via automatic data partitioning. In: Meng, M., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds) Database Systems for Advanced Applications, pp. 244–258. Springer, Berlin (2013). URL https://link.springer.com/chapter/10.1007/978-3-642-37450-0_18

  35. 35.

    Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A Distributed graph engine for web scale RDF data. Proc. VLDB Endow. 6(4), 265–276 (2013)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rebeca Schroeder.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Schroeder, R., Penteado, R.R.M. & Hara, C.S. A data distribution model for RDF. Distrib Parallel Databases 39, 129–167 (2021). https://doi.org/10.1007/s10619-020-07296-w

Download citation

Keywords

  • RDF
  • Data fragmentation
  • Data allocation
  • Distributed databases
  • Dynamic datasets