Workload-Aware Self-tuning Histograms for the Semantic Web

Zamani, Katerina; Charalambidis, Angelos; Konstantopoulos, Stasinos; Zoulis, Nickolas; Mavroudi, Effrosyni

doi:10.1007/978-3-662-53455-7_6

Katerina Zamani¹⁷,
Angelos Charalambidis¹⁷,
Stasinos Konstantopoulos¹⁷,
Nickolas Zoulis^17,18 &
…
Effrosyni Mavroudi¹⁹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 9940))

492 Accesses

Abstract

Query processing systems typically rely on histograms, data structures that approximate data distribution, in order to optimize query execution. Histograms can be constructed by scanning the database tables and aggregating the values of the attributes in the table, or, more efficiently, progressively refined by analysing query results. Most of the relevant literature focuses on histograms of numerical data, exploiting the natural concept of a numerical range as an estimator of the volume of data that falls within the range. This, however, leaves Semantic Web data outside the scope of the histograms literature, as its most prominent datatype, the URI, does not offer itself to defining such ranges. This article first establishes a framework that formalises histograms over arbitrary data types and provides a formalism for specifying value ranges for different datatypes. This makes explicit the properties that ranges are required to have, so that histogram refinement algorithms are applicable. We demonstrate that our framework subsumes histograms over numerical data as a special case by using to formulate the state-of-the-art in numerical histograms. We then proceed to use the Jaro-Winkler metric to define URI ranges by exploiting the hierarchical nature of URI strings. This greatly extends the state of the art, where strings are treated as categorical data that can only be described by enumeration. We then present the open-source STRHist system that implements these ideas. We finally present empirical evaluation results using STRHist over a real dataset and query workload extracted from AGRIS, the most popular and widely used bibliographic database on agricultural research and technology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
STRHist is available at https://github.com/semagrow/strhist. For more details on Semagrow, please see http://semagrow.github.io.
2.
Please see http://agris.fao.org for more details on AGRIS. The AGRIS site mentions 7 million distinct publications, but this includes recent additions that are not in end-2013 data dump used for these experiments.
3.
We use the canonical string representation of URIs as defined in Sect. 2, IETF RFC 7320 (http://tools.ietf.org/html/rfc7320).

References

Bruno, N., Chaudhuri, S.: Exploiting statistics on query expressions for optimization. In: Proceedings of the 2002 ACM International Conference on Management of Data (SIGMOD 2002), New York, NY, USA, pp. 263–274. ACM (2002)
Google Scholar
Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: LEO - DB2’s LEarning optimizer. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, San Francisco, CA, USA, pp. 19–28. Morgan Kaufmann Publishers Inc. (2001)
Google Scholar
Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: building histograms without looking at data. In: Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD 1999), New York, NY, USA, pp. 181–192. ACM (1999)
Google Scholar
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD 2001), pp. 211–222 (2001)
Google Scholar
Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006), Washington, DC, USA. IEEE Computer Society (2006)
Google Scholar
Roh, Y.J., Kim, J.H., Chung, Y.D., Son, J.H., Kim, M.H.: Hierarchically organized skew-tolerant histograms for geographic data objects. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, New York, NY, USA, pp. 627–638. ACM (2010)
Google Scholar
Kaushik, R., Suciu, D.: Consistent histograms in the presence of distinct value counts. Proc. VLDB Endowment 2, 850–861 (2009)
Article Google Scholar
Markl, V., Haas, P.J., Kutsch, M., Megiddo, N., Srivastava, U., Tran, T.M.: Consistent selectivity estimation via maximum entropy. VLDB J. 16, 55–76 (2007)
Article Google Scholar
Bruno, N., Chaudhuri, S., Weikum, G.: Database tuning using online algorithms. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 741–744. Springer, New York (2009)
Google Scholar
Khachatryan, A., Müller, E., Stier, C., Böhm, K.: Sensitivity of self-tuning histograms: query order affecting accuracy and robustness. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 334–342. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31235-9_22
Google Scholar
Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: overcoming the underestimation problem. In: Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), Washington, DC, USA. IEEE Computer Society (2004)
Google Scholar
Lim, L., Wang, M., Vitter, J.S.: CXHist: an on-line classification-based histogram for XML string selectivity estimation. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, 30 August – 2 September 2005, pp. 1187–1198 (2005)
Google Scholar
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM 2004, New York, NY, USA, pp. 652–659. ACM (2004)
Google Scholar
Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS (LNAI), vol. 7603, pp. 353–362. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33876-2_31
Chapter Google Scholar
Langegger, A., Wöss, W.: RDFStats - an extensible RDF statistics generator and library. In: 23rd International Workshop on Database and Expert Systems Applications, Los Alamitos, CA, USA, pp. 79–83. IEEE Computer Society (2009)
Google Scholar
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of the 19th International World Wide Web Conference (WWW 2010), Raleigh, NC, USA, 26–30 April 2010
Google Scholar
Zoulis, N., Mavroudi, E., Lykoura, A., Charalambidis, A., Konstantopoulos, S.: Workload-aware self-tuning histograms of string data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9261, pp. 285–299. Springer, Heidelberg (2015). doi:10.1007/978-3-319-22849-5_20
Chapter Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the Section on Survey Research Methods, Technical report, pp. 354–359. American Statistical Association (1990)
Google Scholar
Charalambidis, A., Troumpoukis, A., Konstantopoulos, S.: SemaGrow: optimizing federated SPARQL queries. In: Proceedings of the 11th International Conference on Semantic Systems (SEMANTiCS 2015), Vienna, Austria, 15–18 September 2015
Google Scholar
Charalambidis, A., Konstantopoulos, S., Karkaletsis, V.: Dataset descriptions for optimizing federated querying. In: Companion Proceedings of the 24th International World Wide Web Conference Companion Proceedings (WWW 2015), Poster Session, Florence, Italy, 18–22 May 2015
Google Scholar
Celli, F., Keizer, J., Jaques, Y., Konstantopoulos, S., Vudragović, D.: Discovering, indexing and interlinking information resources. F1000Research 4 (2015). (Version 2; referees: 3 approved)
Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. 318497. For more details about the SemaGrow project please see http://www.semagrow.eu and about the Semagrow system please see http://semagrow.github.io.

Author information

Authors and Affiliations

Institute of Informatics and Telecommunications, NCSR ‘Demokritos’, Athens, Greece
Katerina Zamani, Angelos Charalambidis, Stasinos Konstantopoulos & Nickolas Zoulis
Computer Science Department, Athens University of Economics and Business, Athens, Greece
Nickolas Zoulis
School of Electrical and Computer Engineering, National Technical University of Athens, Athens, Greece
Effrosyni Mavroudi

Authors

Katerina Zamani
View author publications
You can also search for this author in PubMed Google Scholar
Angelos Charalambidis
View author publications
You can also search for this author in PubMed Google Scholar
Stasinos Konstantopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Nickolas Zoulis
View author publications
You can also search for this author in PubMed Google Scholar
Effrosyni Mavroudi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Angelos Charalambidis .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University , Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz , Linz, Austria
Josef Küng
FAW, University of Linz , Linz, Austria
Roland Wagner
HP Labs , Sunnyvale, California, USA
Qimin Chen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zamani, K., Charalambidis, A., Konstantopoulos, S., Zoulis, N., Mavroudi, E. (2016). Workload-Aware Self-tuning Histograms for the Semantic Web. In: Hameurlain, A., Küng, J., Wagner, R., Chen, Q. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII. Lecture Notes in Computer Science(), vol 9940. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53455-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-662-53455-7_6
Published: 10 September 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-53454-0
Online ISBN: 978-3-662-53455-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics