Building self-clustering RDF databases using Tunable-LSH

Aluç, Güneş; Özsu, M. Tamer; Daudjee, Khuzaima

doi:10.1007/s00778-018-0530-9

Building self-clustering RDF databases using Tunable-LSH

Regular Paper
Published: 03 December 2018

Volume 28, pages 173–195, (2019)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Güneş Aluç¹,
M. Tamer Özsu² &
Khuzaima Daudjee²

736 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

The Resource Description Framework (RDF) is a W3C standard for representing graph-structured data, and SPARQL is the standard query language for RDF. Recent advances in information extraction, linked data management and the Semantic Web have led to a rapid increase in both the volume and the variety of RDF data that are publicly available. As businesses start to capitalize on RDF data, RDF data management systems are being exposed to workloads that are far more diverse and dynamic than what they were designed to handle. Consequently, there is a growing need for developing workload-adaptive and self-tuning RDF data management systems. To realize this objective, we introduce a fast and efficient method for dynamically clustering records in an RDF data management system. Specifically, we assume nothing about the workload upfront, but as SPARQL queries are executed, we keep track of records that are co-accessed by the queries in the workload and physically cluster them. To decide dynamically and in constant-time where a record needs to be placed in the storage system, we develop a new locality-sensitive hashing (LSH) scheme, Tunable-LSH. Using Tunable-LSH, records that are co-accessed across similar sets of queries can be hashed to the same or nearby physical pages in the storage system. What sets Tunable-LSH apart from existing LSH schemes is that it can auto-tune to achieve the aforementioned clustering objective with high accuracy even when the workloads change. Experimental evaluation of Tunable-LSH in an RDF data management system as well as in a standalone hashtable shows end-to-end performance gains over existing solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Schema Discovery for RDF Data

Partitioning Templates for RDF

Semantic Partitioning for RDF Datasets

Notes

The Hamming distance between two record utilization vectors is equal to their edit distance [56], as well as the Manhattan distance [54] between these two vectors in \(l_{1}\) norm.
This uniformity condition simplifies the sensitivity analysis of Tunable-LSH, but it is not a requirement from an algorithmic point of view. Relaxing this condition is left as future work.
Groups are separated by vertical dashed lines.
In practice, this translation is not required because the system maintains positional vectors instead.
https://cs.uwaterloo.ca/~galuc/files/dbpedia-test-queries.tar.gz

References

Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18, 385–406 (2009)
Article Google Scholar
Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 231–258. CRC Press, Boca Raton, Florida (2013)
Chapter Google Scholar
Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in SQL databases. In: Proceedings of the 26th International Conference on Very Large DataBases, pp. 496–505 (2000)
Ailamaki, A., DeWitt, D.J., Hill, M.D., Wood, D.A.: DBMSs on a modern processor: where does time go? In: Proceedings of the 25th International Conference on Very Large DataBases, pp. 266–277 (1999)
Al-Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M.: Accelerating SPARQL queries by exploiting hash-based locality and adaptive partitioning. VLDB J. 25(3), 355–380 (2016)
Article Google Scholar
Al-Harbi, R., Ebrahim, Y., Kalnis, P.: Phd-store: an adaptive SPARQL engine with dynamic partitioning for distributed RDF repositories. CoRR, arXiv:1405.4979 (2014)
Aluç, G.: Workload Matters: A Robust Approach to Physical RDF Database Design. Ph.D. thesis, University of Waterloo (2015). https://uwspace.uwaterloo.ca/handle/10012/9774
Aluç, G., DeHaan, D., Bowman, I.T.: Parametric plan caching using density-based clustering. In: Proceedings of the 28th International Conference on Data Engineering, pp. 402–413 (2012)
Aluç, G., Hartig, O., Özsu, M. T., Daudjee, K.: Diversified stress testing of rdf data management systems. In: Proceedings of the 13th International Semantic Web Conference, pp. 197–212 (2014)
Aluç, G., Özsu, M.T., Daudjee, K.: Workload matters: why RDF databases need a new design. Proc. VLDB Endow. 7(10), 837–840 (2014)
Article Google Scholar
Aluç, G., Özsu, M.T., Daudjee, K., Hartig, O.: Chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10. University of Waterloo (2013)
Aluç, G., Özsu, M. T., Daudjee, K., Hartig, O.: Executing queries over schemaless RDF databases. In: Proceedings of the 31st International Conference on Data Engineering, pp. 807–818 (2015)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: Proceedings of the 47th Annual Symposium on Foundations of Computer Science, pp. 459–468 (2006)
Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An empirical study of real-world SPARQL queries. CoRR, arXiv:1103.5043 (2011)
Athitsos, V., Potamias, M., Papapetrou, P., Kollios, G.: Nearest neighbor retrieval using distance-based hashing. In: Proceedings of the 24th International Conference on Data Engineering, pp. 327–336 (2008)
Bast, H., Buchhold, B.: Qlever: A query engine for efficient sparql+text search. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 647–656 (2017)
Bello, R.G., Dias, K., Downing, A., Feenan, J.J., Finnerty, Jr. J.L., Norcott, W.D., Sun, H., Witkowski, A., Ziauddin, M.: Materialized views in oracle. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 659–664 (1998)
Berendt, B., Dragan, L., Hollink, L., Luczak-Rösch, M., Demidova, E., Dietze, S., Szymanski, J., Breslin, J.G., editors. In: Joint Proceeding of the the 5th International Workshop on Using the Web in the Age of Data and the 2nd International Workshop on Dataset PROFIling and fEderated Search for Linked Data, Volume 1362 of CEUR Workshop Proceedings. CEUR-WS.org (2015)
Bingmann, T.: STX B+ tree C++ template classes. https://panthema.net/2007/stx-btree/ (2007). Accessed 16 Aug 2018
Bislimovska, B., Aluç, G., Özsu, M.T., Fraternali, P.: Graph search of software models using multidimensional scaling. In: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference, pp. 163–170 (2015)
Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 121–132 (2013)
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of Sequences, pp. 21–29 (1997)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations. J. Comput. Syst. Sci. 60(3), 630–659 (2000)
Article MathSciNet MATH Google Scholar
Bruno, N., Chaudhuri, S.: To tune or not to tune? a lightweight physical design alerter. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 499–510 (2006)
Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: Jena: implementing the semantic web recommendations. In: Proceedings of the 13th International World Wide Web Conference—Alternate Track Papers and Posters, pp. 74–83 (2004)
Ceri, S., Navathe, S.B., Wiederhold, G.: Distribution design of logical database schemas. IEEE Trans. Softw. Eng. 9(4), 487–504 (1983)
Article MATH Google Scholar
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on the Theory of Computing, pp. 380–388 (2002)
Chaudhuri, S., Narasayya, V.: Self-tuning database systems: a decade of progress. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 3–14 (2007)
Datar, M., Immorlica, N., Indyk, P., Mirrokni. V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253–262 (2004)
Erling, O.: Virtuoso, a hybrid RDBMS/graph column store. IEEE Data Eng. Bull. 35(1), 3–8 (2012)
Google Scholar
Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., El Abbadi, A.: Approximate nearest neighbor searching in multimedia databases. In: Proceedings 17th International Conference on Data Engineering , pp. 503–511 (2001)
Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F.: Computer Graphics: Principles and Practice, 2nd edn. Addison-Wesley Longman Publishing Co. Inc, Boston (1990)
MATH Google Scholar
French, K.R., Schwert, G.W., Stambaugh, R.F.: Expected stock returns and volatility. J. Finan. Econ. 19, 3–30 (1987)
Article Google Scholar
Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed engine for efficient RDF processing. In: Proceedings of the 23rd International World Wide Web Conference, Companion Volume, pp. 267–268 (2014)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of 25th International Conference on Very Large Data Bases, pp. 518–529 (1999)
Goasdoué, F., Karanasos, K., Leblay, J., Manolescu, I.: View selection in semantic web databases. Proc. VLDB Endow. 5(2), 97–108 (2011)
Article Google Scholar
Graefe, G., Idreos, S., Kuno, H.A., Manegold, S.: Benchmarking adaptive indexing. In: Proceedings of the Performance Evaluation, Measurement and Characterization of Complex Systems—2nd TPC Technology Conference TPCTC, pp. 169–184 (2010)
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: Triad: a distributed shared-nothing RDF engine based on asynchronous message passing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 289–300 (2014)
Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. Proc. VLDB Endow. 5(6), 502–513 (2012)
Article Google Scholar
Hamming, R.W. (ed.): Coding and Information Theory. Prentice-Hall, Englewood Cliffs (1986)
MATH Google Scholar
Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N.: Evaluating SPARQL queries on massive RDF datasets. Proc. VLDB Endow. 8(12), 1848–1851 (2015)
Article Google Scholar
Harris, S., Seaborne, A., Prud’hommeaux. E.: SPARQL 1.1 query language. W3C Recommendation (2013)
Harth, A., Umbrich, J., Hogan, A., Decker, S.: Yars2: A federated repository for querying graph structured data from the web. In: Proceedings of the 6th International Semantic Web Conference, pp. 211–224 (2007)
He, L., Shao, B., Li, Y., Xia, H., Xiao, Y., Chen, E., Chen, L.: Stylus: a strongly-typed store for serving massive RDF data. Proc. VLDB Endow. 11(2), 203–216 (2017)
Article Google Scholar
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Proceedings of Workshops of the 29th IEEE International Conference on Data Engineering, pp. 1–6 (2013)
Houle, M.E., Sakuma, J.: Fast approximate similarity search in extremely high-dimensional data sets. In: Proceedings of the 21st International Conference on Data Engineering, pp. 619–630 (2005)
Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)
Article Google Scholar
Idreos, S., Kersten, M.L., Manegold, S.: Database cracking. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research, pp. 68–78 (2007)
Idreos, S., Manegold, S., Kuno, H.A., Graefe, G.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pp. 604–613 (1998)
Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Article Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)
Article Google Scholar
Kirchberg, M., Ko, R.K.L., Lee, B.-S.: From linked data to relevant data—time is the essence. CoRR, arXiv:1103.5046 (2011)
Krause, E.F. (ed.): Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, New York (1986)
Google Scholar
Kruskal, J.B.: Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29, 1–27 (1964)
Article MathSciNet MATH Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Dokl. 10(8), 707–710 (1966)
MathSciNet Google Scholar
Lightstone, S., Teorey, T.J., Nadeau, T.P.: Physical Database Design: the Database Professional’s Guide to Exploiting Indexes, Views, Storage, and More. Morgan Kaufmann, Burlington (2007)
Google Scholar
McGlothlin, J.P., Khan, L.R.: Materializing and persisting inferred and uncertain knowledge in RDF datasets. In: Proceedings of the 24th International Conference on Artificial Intelligence (2010)
Morrison, A., Ross, G., Chalmers, M.: Fast multidimensional scaling through sampling, springs and interpolation. Inf. Vis. 2(1), 68–77 (2003)
Article Google Scholar
Morsey, M., Lehmann, J., Auer, S., Ngomo, A.-C.N.: DBpedia SPARQL benchmark—performance assessment with real queries on real data. In: Proceedings of the 10th International Semantic Web Conference, pp. 454–469 (2011)
Morton, G.M.: A computer oriented geodetic data base; and a new technique in file sequencing. Technical report. IBM Ltd., Ottawa, Canada (1966)
Nah, F.F.-H.: A study on tolerable waiting time: how long are Web users willing to wait? Behav. IT 23(3), 153–163 (2004)
Google Scholar
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
Article Google Scholar
Neumann, T., Weikum, G.: x-RDF-3X: fast querying, high update rates, and consistency for RDF databases. Proc. VLDB Endow. 3(1), 256–263 (2010)
Article Google Scholar
Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N: H2RDF: adaptive query processing on RDF data in the cloud. In: Proceedings of the 21st International World Wide Web Conference Companion Volume, pp. 397–400 (2012)
Papailiou, N., Tsoumakos, D., Karras, P., Koziris, N.: Graph-aware, workload-adaptive SPARQL query caching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1777–1792 (2015)
Reed, W.: The normal-Laplace distribution and its relatives. In: Proceedings of the Advances in Distribution Theory, Order Statistics, and Inference, pp. 61–74 (2006)
Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International World Wide Web Conference, pp 851–860 (2010)
Sidirourgos, L., Goncalves, R., Kersten, M., Nes, N., Manegold, S.: Column-store support for RDF data management: not all swans are white. Proc. VLDB Endow. 1(2), 1553–1563 (2008)
Article Google Scholar
std::hash. http://www.cplusplus.com/reference/functional/hash/ (2015). Accessed 16 Aug 2018
std::map. http://www.cplusplus.com/reference/map/map/ (2015). Accessed 16 Aug 2018
std::unordered\_map. http://www.cplusplus.com/reference/unordered_map/unordered_map/ (2015). Accessed 16 Aug 2018
Tao, Y., Yi, K., Sheng, K., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20 (2010)
Article Google Scholar
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endow. 1(1), 1008–1019 (2008)
Article Google Scholar
Wilkinson, K.: Jena property table implementation. Technical Report HPL-2006-140, HP-Labs (2006)
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. Proc. VLDB Endow. 6(7), 517–528 (2013)
Article Google Scholar
Zeng, L., Zou, L.: Redesign of the gStore system. Front. Comput. Sci. 12(4), 623–641 (2018)
Article Google Scholar
Zilio, D.C., Rao, J., Lightstone, S., Lohman, G. M., Storm, A.J., Garcia-Arellano, C., Fadden, S.: DB2 design advisor: integrated automatic physical database design. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 1087–1097 (2004)
Zou, L., Mo, J., Zhao, D., Chen, L., Özsu, M.T.: gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endow. 4(1), 482–493 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

SAP Labs, Waterloo, ON, Canada
Güneş Aluç
University of Waterloo, Waterloo, ON, Canada
M. Tamer Özsu & Khuzaima Daudjee

Authors

Güneş Aluç
View author publications
You can also search for this author in PubMed Google Scholar
M. Tamer Özsu
View author publications
You can also search for this author in PubMed Google Scholar
Khuzaima Daudjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Güneş Aluç.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aluç, G., Özsu, M.T. & Daudjee, K. Building self-clustering RDF databases using Tunable-LSH. The VLDB Journal 28, 173–195 (2019). https://doi.org/10.1007/s00778-018-0530-9

Download citation

Received: 14 October 2017
Revised: 16 August 2018
Accepted: 23 November 2018
Published: 03 December 2018
Issue Date: 11 April 2019
DOI: https://doi.org/10.1007/s00778-018-0530-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building self-clustering RDF databases using Tunable-LSH

Abstract

Access this article

Similar content being viewed by others

Scalable Schema Discovery for RDF Data

Partitioning Templates for RDF

Semantic Partitioning for RDF Datasets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Building self-clustering RDF databases using Tunable-LSH

Abstract

Access this article

Similar content being viewed by others

Scalable Schema Discovery for RDF Data

Partitioning Templates for RDF

Semantic Partitioning for RDF Datasets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation