Advertisement

The VLDB Journal

, Volume 28, Issue 2, pp 173–195 | Cite as

Building self-clustering RDF databases using Tunable-LSH

  • Güneş AluçEmail author
  • M. Tamer Özsu
  • Khuzaima Daudjee
Regular Paper
  • 117 Downloads

Abstract

The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.

Keywords

RDF SPARQL Graph data management Storage and indexing Workload-adaptive tuning Locality-sensitive hashing Clustering Physical database design 

Notes

References

  1. 1.
    Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18, 385–406 (2009)CrossRefGoogle Scholar
  2. 2.
    Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 231–258. CRC Press, Boca Raton, Florida (2013)CrossRefGoogle Scholar
  3. 3.
    Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large DataBases, pp. 496–505 (2000)Google Scholar
  4. 4.
    Ailamaki, A., DeWitt, D.J., Hill, M.D., Wood, D.A.: DBMSs on a modern processor: where does time go? In: Proceedings of the 25th International Conference on Very Large DataBases, pp. 266–277 (1999)Google Scholar
  5. 5.
    Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M.: Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB J. 25(3), 355–380 (2016)CrossRefGoogle Scholar
  6. 6.
    Al-Harbi, R., Ebrahim, Y., Kalnis, P.: Phd-store: an adaptive SPARQL engine with dynamic partitioning for distributed RDF repositories. CoRR, arXiv:1405.4979 (2014)
  7. 7.
    Aluç, G.: Workload Matters: A Robust Approach to Physical RDF Database Design. Ph.D. thesis, University of Waterloo (2015). https://uwspace.uwaterloo.ca/handle/10012/9774
  8. 8.
    Aluç, G., DeHaan, D., Bowman, I.T.: Parametric plan caching using density-based clustering. In: Proceedings of the 28th International Conference on Data Engineering, pp. 402–413 (2012)Google Scholar
  9. 9.
    Aluç, G., Hartig, O., Özsu, M. T., Daudjee, K.: Diversified stress testing of rdf data management systems. In: Proceedings of the 13th International Semantic Web Conference, pp. 197–212 (2014)Google Scholar
  10. 10.
    Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: why RDF databases need a new design. Proc. VLDB Endow. 7(10), 837–840 (2014)CrossRefGoogle Scholar
  11. 11.
    Aluç, G., Özsu, M.T., Daudjee, K., Hartig, O.: Chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10. University of Waterloo (2013)Google Scholar
  12. 12.
    Aluç, G., Özsu, M. T., Daudjee, K., Hartig, O.: Executing queries over schemaless RDF databases. In: Proceedings of the 31st International Conference on Data Engineering, pp. 807–818 (2015)Google Scholar
  13. 13.
    Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th Annual Symposium on Foundations of Computer Science, pp. 459–468 (2006)Google Scholar
  14. 14.
    Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)
  15. 15.
    Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: Proceedings of the 24th International Conference on Data Engineering, pp. 327–336 (2008)Google Scholar
  16. 16.
    Bast, H., Buchhold, B.: Qlever: A query engine for efficient sparql+text search. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 647–656 (2017)Google Scholar
  17. 17.
    Bello, R.G., Dias, K., Downing, A., Feenan, J.J., Finnerty, Jr. J.L., Norcott, W.D., Sun, H., Witkowski, A., Ziauddin, M.: Materialized views in oracle. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 659–664 (1998)Google Scholar
  18. 18.
    Berendt, B., Dragan, L., Hollink, L., Luczak-Rösch, M., Demidova, E., Dietze, S., Szymanski, J., Breslin, J.G., editors. In: Joint Proceeding of the the 5th International Workshop on Using the Web in the Age of Data and the 2nd International Workshop on Dataset PROFIling and fEderated Search for Linked Data, Volume 1362 of CEUR Workshop Proceedings. CEUR-WS.org (2015)Google Scholar
  19. 19.
    Bingmann, T.: STX B+ tree C++ template classes. https://panthema.net/2007/stx-btree/ (2007). Accessed 16 Aug 2018
  20. 20.
    Bislimovska, B., Aluç, G., Özsu, M.T., Fraternali, P.: Graph search of software models using multidimensional scaling. In: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference, pp. 163–170 (2015)Google Scholar
  21. 21.
    Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2013)Google Scholar
  22. 22.
    Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29 (1997)Google Scholar
  23. 23.
    Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Bruno, N., Chaudhuri, S.: To tune or not to tune? a lightweight physical design alerter. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 499–510 (2006)Google Scholar
  25. 25.
    Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: Jena: implementing the semantic web recommendations. In: Proceedings of the 13th International World Wide Web Conference—Alternate Track Papers and Posters, pp. 74–83 (2004)Google Scholar
  26. 26.
    Ceri, S., Navathe, S.B., Wiederhold, G.: Distribution design of logical database schemas. IEEE Trans. Softw. Eng. 9(4), 487–504 (1983)CrossRefzbMATHGoogle Scholar
  27. 27.
    Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on the Theory of Computing, pp. 380–388 (2002)Google Scholar
  28. 28.
    Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14 (2007)Google Scholar
  29. 29.
    Datar, M., Immorlica, N., Indyk, P., Mirrokni. V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253–262 (2004)Google Scholar
  30. 30.
    Erling, O.: Virtuoso, a hybrid RDBMS/graph column store. IEEE Data Eng. Bull. 35(1), 3–8 (2012)Google Scholar
  31. 31.
    Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Approximate nearest neighbor searching in multimedia databases. In: Proceedings 17th International Conference on Data Engineering , pp. 503–511 (2001)Google Scholar
  32. 32.
    Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer Graphics: Principles and Practice, 2nd edn. Addison-Wesley Longman Publishing Co. Inc, Boston (1990)zbMATHGoogle Scholar
  33. 33.
    French, K.R., Schwert, G.W., Stambaugh, R.F.: Expected stock returns and volatility. J. Finan. Econ. 19, 3–30 (1987)CrossRefGoogle Scholar
  34. 34.
    Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. In: Proceedings of the 23rd International World Wide Web Conference, Companion Volume, pp. 267–268 (2014)Google Scholar
  35. 35.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)Google Scholar
  36. 36.
    Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. Proc. VLDB Endow. 5(2), 97–108 (2011)CrossRefGoogle Scholar
  37. 37.
    Graefe, G., Idreos, S., Kuno, H.A., Manegold, S.: Benchmarking adaptive indexing. In: Proceedings of the Performance Evaluation, Measurement and Characterization of Complex Systems—2nd TPC Technology Conference TPCTC, pp. 169–184 (2010)Google Scholar
  38. 38.
    Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 289–300 (2014)Google Scholar
  39. 39.
    Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. Proc. VLDB Endow. 5(6), 502–513 (2012)CrossRefGoogle Scholar
  40. 40.
    Hamming, R.W. (ed.): Coding and Information Theory. Prentice-Hall, Englewood Cliffs (1986)zbMATHGoogle Scholar
  41. 41.
    Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N.: Evaluating SPARQL queries on massive RDF datasets. Proc. VLDB Endow. 8(12), 1848–1851 (2015)CrossRefGoogle Scholar
  42. 42.
    Harris, S., Seaborne, A., Prud’hommeaux. E.: SPARQL 1.1 query language. W3C Recommendation (2013)Google Scholar
  43. 43.
    Harth, A., Umbrich, J., Hogan, A., Decker, S.: Yars2: A federated repository for querying graph structured data from the web. In: Proceedings of the 6th International Semantic Web Conference, pp. 211–224 (2007)Google Scholar
  44. 44.
    He, L., Shao, B., Li, Y., Xia, H., Xiao, Y., Chen, E., Chen, L.: Stylus: a strongly-typed store for serving massive RDF data. Proc. VLDB Endow. 11(2), 203–216 (2017)CrossRefGoogle Scholar
  45. 45.
    Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Proceedings of Workshops of the 29th IEEE International Conference on Data Engineering, pp. 1–6 (2013)Google Scholar
  46. 46.
    Houle, M.E., Sakuma, J.: Fast approximate similarity search in extremely high-dimensional data sets. In: Proceedings of the 21st International Conference on Data Engineering, pp. 619–630 (2005)Google Scholar
  47. 47.
    Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)CrossRefGoogle Scholar
  48. 48.
    Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 68–78 (2007)Google Scholar
  49. 49.
    Idreos, S., Manegold, S., Kuno, H.A., Graefe, G.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)Google Scholar
  50. 50.
    Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)Google Scholar
  51. 51.
    Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)CrossRefGoogle Scholar
  52. 52.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)CrossRefGoogle Scholar
  53. 53.
    Kirchberg, M., Ko, R.K.L., Lee, B.-S.: From linked data to relevant data—time is the essence. CoRR, arXiv:1103.5046 (2011)
  54. 54.
    Krause, E.F. (ed.): Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, New York (1986)Google Scholar
  55. 55.
    Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964)MathSciNetCrossRefzbMATHGoogle Scholar
  56. 56.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  57. 57.
    Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: the Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, Burlington (2007)Google Scholar
  58. 58.
    McGlothlin, J.P., Khan, L.R.: Materializing and persisting inferred and uncertain knowledge in RDF datasets. In: Proceedings of the 24th International Conference on Artificial Intelligence (2010)Google Scholar
  59. 59.
    Morrison, A., Ross, G., Chalmers, M.: Fast multidimensional scaling through sampling, springs and interpolation. Inf. Vis. 2(1), 68–77 (2003)CrossRefGoogle Scholar
  60. 60.
    Morsey, M., Lehmann, J., Auer, S., Ngomo, A.-C.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: Proceedings of the 10th International Semantic Web Conference, pp. 454–469 (2011)Google Scholar
  61. 61.
    Morton, G.M.: A computer oriented geodetic data base; and a new technique in file sequencing. Technical report. IBM Ltd., Ottawa, Canada (1966)Google Scholar
  62. 62.
    Nah, F.F.-H.: A study on tolerable waiting time: how long are Web users willing to wait? Behav. IT 23(3), 153–163 (2004)Google Scholar
  63. 63.
    Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)CrossRefGoogle Scholar
  64. 64.
    Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. Proc. VLDB Endow. 3(1), 256–263 (2010)CrossRefGoogle Scholar
  65. 65.
    Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N: H2RDF: adaptive query processing on RDF data in the cloud. In: Proceedings of the 21st International World Wide Web Conference Companion Volume, pp. 397–400 (2012)Google Scholar
  66. 66.
    Papailiou, N., Tsoumakos, D., Karras, P., Koziris, N.: Graph-aware, workload-adaptive SPARQL query caching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1777–1792 (2015)Google Scholar
  67. 67.
    Reed, W.: The normal-Laplace distribution and its relatives. In: Proceedings of the Advances in Distribution Theory, Order Statistics, and Inference, pp. 61–74 (2006)Google Scholar
  68. 68.
    Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International World Wide Web Conference, pp 851–860 (2010)Google Scholar
  69. 69.
    Sidirourgos, L., Goncalves, R., Kersten, M., Nes, N., Manegold, S.: Column-store support for RDF data management: not all swans are white. Proc. VLDB Endow. 1(2), 1553–1563 (2008)CrossRefGoogle Scholar
  70. 70.
    std::hash. http://www.cplusplus.com/reference/functional/hash/ (2015). Accessed 16 Aug 2018
  71. 71.
    std::map. http://www.cplusplus.com/reference/map/map/ (2015). Accessed 16 Aug 2018
  72. 72.
    std::unordered\_map. http://www.cplusplus.com/reference/unordered_map/unordered_map/ (2015). Accessed 16 Aug 2018
  73. 73.
    Tao, Y., Yi, K., Sheng, K., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20 (2010)CrossRefGoogle Scholar
  74. 74.
    Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endow. 1(1), 1008–1019 (2008)CrossRefGoogle Scholar
  75. 75.
    Wilkinson, K.: Jena property table implementation. Technical Report HPL-2006-140, HP-Labs (2006)Google Scholar
  76. 76.
    Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. Proc. VLDB Endow. 6(7), 517–528 (2013)CrossRefGoogle Scholar
  77. 77.
    Zeng, L., Zou, L.: Redesign of the gStore system. Front. Comput. Sci. 12(4), 623–641 (2018)CrossRefGoogle Scholar
  78. 78.
    Zilio, D.C., Rao, J., Lightstone, S., Lohman, G. M., Storm, A.J., Garcia-Arellano, C., Fadden, S.: DB2 design advisor: integrated automatic physical database design. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 1087–1097 (2004)Google Scholar
  79. 79.
    Zou, L., Mo, J., Zhao, D., Chen, L., Özsu, M.T.: gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endow. 4(1), 482–493 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Güneş Aluç
    • 1
    Email author
  • M. Tamer Özsu
    • 2
  • Khuzaima Daudjee
    • 2
  1. 1.SAP LabsWaterlooCanada
  2. 2.University of WaterlooWaterlooCanada

Personalised recommendations