Skip to main content
Log in

Building self-clustering RDF databases using Tunable-LSH

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The Hamming distance between two record utilization vectors is equal to their edit distance [56], as well as the Manhattan distance [54] between these two vectors in \(l_{1}\) norm.

  2. This uniformity condition simplifies the sensitivity analysis of Tunable-LSH, but it is not a requirement from an algorithmic point of view. Relaxing this condition is left as future work.

  3. Groups are separated by vertical dashed lines.

  4. In practice, this translation is not required because the system maintains positional vectors instead.

  5. https://cs.uwaterloo.ca/~galuc/files/dbpedia-test-queries.tar.gz

References

  1. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18, 385–406 (2009)

    Article  Google Scholar 

  2. Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 231–258. CRC Press, Boca Raton, Florida (2013)

    Chapter  Google Scholar 

  3. Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large DataBases, pp. 496–505 (2000)

  4. Ailamaki, A., DeWitt, D.J., Hill, M.D., Wood, D.A.: DBMSs on a modern processor: where does time go? In: Proceedings of the 25th International Conference on Very Large DataBases, pp. 266–277 (1999)

  5. Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M.: Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB J. 25(3), 355–380 (2016)

    Article  Google Scholar 

  6. Al-Harbi, R., Ebrahim, Y., Kalnis, P.: Phd-store: an adaptive SPARQL engine with dynamic partitioning for distributed RDF repositories. CoRR, arXiv:1405.4979 (2014)

  7. Aluç, G.: Workload Matters: A Robust Approach to Physical RDF Database Design. Ph.D. thesis, University of Waterloo (2015). https://uwspace.uwaterloo.ca/handle/10012/9774

  8. Aluç, G., DeHaan, D., Bowman, I.T.: Parametric plan caching using density-based clustering. In: Proceedings of the 28th International Conference on Data Engineering, pp. 402–413 (2012)

  9. Aluç, G., Hartig, O., Özsu, M. T., Daudjee, K.: Diversified stress testing of rdf data management systems. In: Proceedings of the 13th International Semantic Web Conference, pp. 197–212 (2014)

  10. Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: why RDF databases need a new design. Proc. VLDB Endow. 7(10), 837–840 (2014)

    Article  Google Scholar 

  11. Aluç, G., Özsu, M.T., Daudjee, K., Hartig, O.: Chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10. University of Waterloo (2013)

  12. Aluç, G., Özsu, M. T., Daudjee, K., Hartig, O.: Executing queries over schemaless RDF databases. In: Proceedings of the 31st International Conference on Data Engineering, pp. 807–818 (2015)

  13. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th Annual Symposium on Foundations of Computer Science, pp. 459–468 (2006)

  14. Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)

  15. Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: Proceedings of the 24th International Conference on Data Engineering, pp. 327–336 (2008)

  16. Bast, H., Buchhold, B.: Qlever: A query engine for efficient sparql+text search. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 647–656 (2017)

  17. Bello, R.G., Dias, K., Downing, A., Feenan, J.J., Finnerty, Jr. J.L., Norcott, W.D., Sun, H., Witkowski, A., Ziauddin, M.: Materialized views in oracle. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 659–664 (1998)

  18. Berendt, B., Dragan, L., Hollink, L., Luczak-Rösch, M., Demidova, E., Dietze, S., Szymanski, J., Breslin, J.G., editors. In: Joint Proceeding of the the 5th International Workshop on Using the Web in the Age of Data and the 2nd International Workshop on Dataset PROFIling and fEderated Search for Linked Data, Volume 1362 of CEUR Workshop Proceedings. CEUR-WS.org (2015)

  19. Bingmann, T.: STX B+ tree C++ template classes. https://panthema.net/2007/stx-btree/ (2007). Accessed 16 Aug 2018

  20. Bislimovska, B., Aluç, G., Özsu, M.T., Fraternali, P.: Graph search of software models using multidimensional scaling. In: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference, pp. 163–170 (2015)

  21. Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2013)

  22. Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29 (1997)

  23. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  24. Bruno, N., Chaudhuri, S.: To tune or not to tune? a lightweight physical design alerter. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 499–510 (2006)

  25. Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: Jena: implementing the semantic web recommendations. In: Proceedings of the 13th International World Wide Web Conference—Alternate Track Papers and Posters, pp. 74–83 (2004)

  26. Ceri, S., Navathe, S.B., Wiederhold, G.: Distribution design of logical database schemas. IEEE Trans. Softw. Eng. 9(4), 487–504 (1983)

    Article  MATH  Google Scholar 

  27. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on the Theory of Computing, pp. 380–388 (2002)

  28. Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14 (2007)

  29. Datar, M., Immorlica, N., Indyk, P., Mirrokni. V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253–262 (2004)

  30. Erling, O.: Virtuoso, a hybrid RDBMS/graph column store. IEEE Data Eng. Bull. 35(1), 3–8 (2012)

    Google Scholar 

  31. Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Approximate nearest neighbor searching in multimedia databases. In: Proceedings 17th International Conference on Data Engineering , pp. 503–511 (2001)

  32. Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer Graphics: Principles and Practice, 2nd edn. Addison-Wesley Longman Publishing Co. Inc, Boston (1990)

    MATH  Google Scholar 

  33. French, K.R., Schwert, G.W., Stambaugh, R.F.: Expected stock returns and volatility. J. Finan. Econ. 19, 3–30 (1987)

    Article  Google Scholar 

  34. Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. In: Proceedings of the 23rd International World Wide Web Conference, Companion Volume, pp. 267–268 (2014)

  35. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)

  36. Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. Proc. VLDB Endow. 5(2), 97–108 (2011)

    Article  Google Scholar 

  37. Graefe, G., Idreos, S., Kuno, H.A., Manegold, S.: Benchmarking adaptive indexing. In: Proceedings of the Performance Evaluation, Measurement and Characterization of Complex Systems—2nd TPC Technology Conference TPCTC, pp. 169–184 (2010)

  38. Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 289–300 (2014)

  39. Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. Proc. VLDB Endow. 5(6), 502–513 (2012)

    Article  Google Scholar 

  40. Hamming, R.W. (ed.): Coding and Information Theory. Prentice-Hall, Englewood Cliffs (1986)

    MATH  Google Scholar 

  41. Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N.: Evaluating SPARQL queries on massive RDF datasets. Proc. VLDB Endow. 8(12), 1848–1851 (2015)

    Article  Google Scholar 

  42. Harris, S., Seaborne, A., Prud’hommeaux. E.: SPARQL 1.1 query language. W3C Recommendation (2013)

  43. Harth, A., Umbrich, J., Hogan, A., Decker, S.: Yars2: A federated repository for querying graph structured data from the web. In: Proceedings of the 6th International Semantic Web Conference, pp. 211–224 (2007)

  44. He, L., Shao, B., Li, Y., Xia, H., Xiao, Y., Chen, E., Chen, L.: Stylus: a strongly-typed store for serving massive RDF data. Proc. VLDB Endow. 11(2), 203–216 (2017)

    Article  Google Scholar 

  45. Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Proceedings of Workshops of the 29th IEEE International Conference on Data Engineering, pp. 1–6 (2013)

  46. Houle, M.E., Sakuma, J.: Fast approximate similarity search in extremely high-dimensional data sets. In: Proceedings of the 21st International Conference on Data Engineering, pp. 619–630 (2005)

  47. Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)

    Article  Google Scholar 

  48. Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 68–78 (2007)

  49. Idreos, S., Manegold, S., Kuno, H.A., Graefe, G.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)

    Google Scholar 

  50. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)

  51. Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  52. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)

    Article  Google Scholar 

  53. Kirchberg, M., Ko, R.K.L., Lee, B.-S.: From linked data to relevant data—time is the essence. CoRR, arXiv:1103.5046 (2011)

  54. Krause, E.F. (ed.): Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, New York (1986)

    Google Scholar 

  55. Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  56. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  57. Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: the Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, Burlington (2007)

    Google Scholar 

  58. McGlothlin, J.P., Khan, L.R.: Materializing and persisting inferred and uncertain knowledge in RDF datasets. In: Proceedings of the 24th International Conference on Artificial Intelligence (2010)

  59. Morrison, A., Ross, G., Chalmers, M.: Fast multidimensional scaling through sampling, springs and interpolation. Inf. Vis. 2(1), 68–77 (2003)

    Article  Google Scholar 

  60. Morsey, M., Lehmann, J., Auer, S., Ngomo, A.-C.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: Proceedings of the 10th International Semantic Web Conference, pp. 454–469 (2011)

  61. Morton, G.M.: A computer oriented geodetic data base; and a new technique in file sequencing. Technical report. IBM Ltd., Ottawa, Canada (1966)

  62. Nah, F.F.-H.: A study on tolerable waiting time: how long are Web users willing to wait? Behav. IT 23(3), 153–163 (2004)

    Google Scholar 

  63. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)

    Article  Google Scholar 

  64. Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. Proc. VLDB Endow. 3(1), 256–263 (2010)

    Article  Google Scholar 

  65. Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N: H2RDF: adaptive query processing on RDF data in the cloud. In: Proceedings of the 21st International World Wide Web Conference Companion Volume, pp. 397–400 (2012)

  66. Papailiou, N., Tsoumakos, D., Karras, P., Koziris, N.: Graph-aware, workload-adaptive SPARQL query caching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1777–1792 (2015)

  67. Reed, W.: The normal-Laplace distribution and its relatives. In: Proceedings of the Advances in Distribution Theory, Order Statistics, and Inference, pp. 61–74 (2006)

  68. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International World Wide Web Conference, pp 851–860 (2010)

  69. Sidirourgos, L., Goncalves, R., Kersten, M., Nes, N., Manegold, S.: Column-store support for RDF data management: not all swans are white. Proc. VLDB Endow. 1(2), 1553–1563 (2008)

    Article  Google Scholar 

  70. std::hash. http://www.cplusplus.com/reference/functional/hash/ (2015). Accessed 16 Aug 2018

  71. std::map. http://www.cplusplus.com/reference/map/map/ (2015). Accessed 16 Aug 2018

  72. std::unordered\_map. http://www.cplusplus.com/reference/unordered_map/unordered_map/ (2015). Accessed 16 Aug 2018

  73. Tao, Y., Yi, K., Sheng, K., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20 (2010)

    Article  Google Scholar 

  74. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endow. 1(1), 1008–1019 (2008)

    Article  Google Scholar 

  75. Wilkinson, K.: Jena property table implementation. Technical Report HPL-2006-140, HP-Labs (2006)

  76. Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. Proc. VLDB Endow. 6(7), 517–528 (2013)

    Article  Google Scholar 

  77. Zeng, L., Zou, L.: Redesign of the gStore system. Front. Comput. Sci. 12(4), 623–641 (2018)

    Article  Google Scholar 

  78. Zilio, D.C., Rao, J., Lightstone, S., Lohman, G. M., Storm, A.J., Garcia-Arellano, C., Fadden, S.: DB2 design advisor: integrated automatic physical database design. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 1087–1097 (2004)

  79. Zou, L., Mo, J., Zhao, D., Chen, L., Özsu, M.T.: gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endow. 4(1), 482–493 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Güneş Aluç.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aluç, G., Özsu, M.T. & Daudjee, K. Building self-clustering RDF databases using Tunable-LSH. The VLDB Journal 28, 173–195 (2019). https://doi.org/10.1007/s00778-018-0530-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0530-9

Keywords

Navigation