Skip to main content

A Worst-Case Optimal Join Algorithm for SPARQL

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2019 (ISWC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11778))

Included in the following conference series:

Abstract

Worst-case optimal multiway join algorithms have recently gained a lot of attention in the database literature. These algorithms not only offer strong theoretical guarantees of efficiency, but have also been empirically demonstrated to significantly improve query runtimes for relational and graph databases. Despite these promising theoretical and practical results, however, the Semantic Web community has yet to adopt such techniques; to the best of our knowledge, no native RDF database currently supports such join algorithms, where in this paper we demonstrate that this should change. We propose a novel procedure for evaluating SPARQL queries based on an existing worst-case join algorithm called Leapfrog Triejoin. We propose an adaptation of this algorithm for evaluating SPARQL queries, and implement it in Apache Jena. We then present experiments over the Berlin and WatDiv SPARQL benchmarks, and a novel benchmark that we propose based on Wikidata that is designed to provide insights into join performance for a more diverse set of basic graph patterns. Our results show that with this new join algorithm, Apache Jena often runs orders of magnitude faster than the base version and two other SPARQL engines: Virtuoso and Blazegraph.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use \(\mathbf I \mathbf B \) as a shortcut for \(\mathbf I \cup \mathbf B \), etc.

  2. 2.

    Other features like BIND, VALUES, SERVICE, etc., that generate or extend mappings can be evaluated in the standard way.

  3. 3.

    Following [18], we use the notation (s|?, p|?, o|?) to denote eight forms of triple patterns where, for example, (?, po) refers to the set of triple patterns with variable subject, constant predicate and constant object: \(\mathbf V \times \mathbf I \times \mathbf I \mathbf L \).

  4. 4.

    Such an order would be produced by SPARQL engines in practice if we had a graph with many :father and :mother relations, outnumbering :winner relations.

  5. 5.

    We currently consider querying over a single RDF graph; if we were to consider a complete index on quads in order to support named graphs, the number of required indexes would jump to 24. In such a case, however, practical steps can be taken to reduce the number of indexes where, for example, some such orders will be rarely accessed by real-world queries and can thus be removed.

References

  1. Apache Jena. https://jena.apache.org/. Accessed 30 Dec 2018

  2. Github project. https://gqgh5wfgzt.github.io/benchmark-leapfrog/

  3. SPARQL 1.1 Query Language. https://www.w3.org/TR/sparql11-query/. Accessed 30 Dec 2018

  4. Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. ACM Trans. Database Syst. (TODS) 42(4), 20 (2017)

    Article  MathSciNet  Google Scholar 

  5. Abo Khamis, M., Ngo, H.Q., Rudra, A.: FAQ: questions asked frequently. In: Principles of Database Systems (PODS), pp. 13–28. ACM (2016)

    Google Scholar 

  6. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stress testing of RDF data management systems. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 197–212. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_13

    Chapter  Google Scholar 

  7. Angles, R., Gutierrez, C.: The multiset semantics of SPARQL patterns. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_2

    Chapter  Google Scholar 

  8. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.-Y.: SPARQL web-querying infrastructure: ready for action? ISWC 2013. LNCS, vol. 8219, pp. 277–293. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_18

    Chapter  Google Scholar 

  9. Aref, M.: Design and implementation of the LogicBlox system. In: SIGMOD International Conference on Management of Data, pp. 1371–1382. ACM (2015)

    Google Scholar 

  10. Atre, M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix “Bit” loaded: a scalable lightweight join query processor for RDF data. In: World Wide Web (WWW), pp. 41–50 (2010)

    Google Scholar 

  11. Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science (FOCS), pp. 739–748. IEEE (2008)

    Google Scholar 

  12. Barbay, J., Kenyon, C.: Adaptive intersection and t-threshold problems. In: Symposium on Discrete Algorithms (SODA), pp. 390–399. Society for Industrial and Applied Mathematics (2002)

    Google Scholar 

  13. Bizer, C., Schultz, A.: The Berlin SPARQL benchmark. Int. J. Semant. Web Inf. Syst. (IJSWIS) 5(2), 1–24 (2009)

    Article  Google Scholar 

  14. Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)

    Google Scholar 

  15. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: Symposium on Discrete Algorithms (SODA). Citeseer (2000)

    Google Scholar 

  16. Erling, O., Mikhailov, I.: RDF support in the virtuoso DBMS. In: Pellegrini, T., Auer, S., Tochtermann, K., Schaffert, S. (eds.) Networked Knowledge - Networked Media. Studies in Computational Intelligence, vol. 221, pp. 7–24. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02184-8_2

    Chapter  Google Scholar 

  17. Galkin, M., Endris, K.M., Acosta, M., Collarana, D., Vidal, M., Auer, S.: SMJoin: a multi-way join operator for SPARQL queries. In: International Conference on Semantic Systems (SEMANTICS), pp. 104–111 (2017)

    Google Scholar 

  18. Harth, A., Decker, S.: Optimized index structures for querying RDF from the Web. In: Latin American Web Congress (LA-Web 2005), pp. 71–80 (2005)

    Google Scholar 

  19. Kalinsky, O., Etsion, Y., Kimelfeld B.: Flexible caching in Trie joins. In: International Conference on Extending Database Technology (EDBT), pp. 282–293. Springer (2017). https://doi.org/10.5441/002/edbt.2017.26

  20. Kalinsky, O., Mishali, O., Hogan, A., Etsion, Y., Kimelfeld, B.: Efficiently charting RDF. CoRR, abs/1811.10955 (2018)

    Google Scholar 

  21. Khamis, M.A., Ngo, H.Q., Ré, C., Rudra, A.: Joins via geometric resolutions: worst case and beyond. ACM Trans. Database Syst. (TODS) 41(4), 22 (2016)

    Article  MathSciNet  Google Scholar 

  22. Malyshev, S., Krötzsch, M., González, L., Gonsior, J., Bielefeldt, A.: Getting the most out of Wikidata: semantic technology usage in Wikipedia’s knowledge graph. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 376–394. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_23

    Chapter  Google Scholar 

  23. Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. PVLDB 1(1), 647–659 (2008)

    Google Scholar 

  24. Ngo, H.Q.: Worst-case optimal join algorithms: techniques, results, and open problems. In: Principles of Database Systems (PODS), pp. 111–124. ACM (2018)

    Google Scholar 

  25. Ngo, H.Q., Nguyen, D.T., Re, C., Rudra, A.: Beyond worst-case analysis for joins with minesweeper. In: Principles of Database Systems (PODS), pp. 234–245. ACM (2014)

    Google Scholar 

  26. Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms. In: Principles of Database Systems (PODS), pp. 37–48. ACM (2012)

    Google Scholar 

  27. Ngo, H.Q., Ré, C., Rudra, A.: Skew strikes back: new developments in the theory of join algorithms. arXiv preprint arXiv:1310.3314 (2013)

  28. Nguyen, D., et al.: Join processing for graph patterns: an old dog with new tricks. In: GRADES, p. 2. ACM (2015)

    Google Scholar 

  29. Ramakrishnan, R., Gehrke, J.: Database Management Systems. McGraw Hill, New York (2000)

    MATH  Google Scholar 

  30. Thompson, B.B., Personick, M., Cutcher, M.: The Bigdata®RDF graph database. In: Linked Data Management, pp. 193–237 (2014)

    Google Scholar 

  31. Veldhuizen, T.L.: Leapfrog Triejoin: a simple, worst-case optimal join algorithm. In: ICDT, pp. 96–106 (2014)

    Google Scholar 

  32. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 78–85 (2014)

    Article  Google Scholar 

  33. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Millennium Institute for Foundational Research on Data (IMFD) and by Fondecyt Grant No. 1181896.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aidan Hogan .

Editor information

Editors and Affiliations

A Proof of Theorem 1

A Proof of Theorem 1

Fix a basic graph pattern P, an RDF graph G, and \(\mathtt {?x}_1, \ldots , \mathtt {?x}_n\) the chosen variable order. Further assume that the computation of \({\text {LF}}_G(P',\mathtt {?x})\) takes time at most \(\min _{t \in P' : \mathtt {?x}\in {\text {var}}(t)} |\pi _{\mathtt {?x}}(\llbracket t \rrbracket _G)| \, \cdot \, \log (|G|)\) for every basic graph pattern \(P'\) (for simplicity, we will omit the trivial empty case where there exists \(t \in P'\) such that \(\mathtt {?x}\in {\text {var}}(t)\) and \(|\pi _{\mathtt {?x}}(\llbracket t \rrbracket _G)| = 0\) since the time taken will be simply \(\log (|G|)\)). Finally, for every RDF graph \(G'\) we will say that \(G'\) is of size less than G whenever \(|\llbracket t \rrbracket _{G_i}| \le |\llbracket t \rrbracket _{G}|\) for all \(t \in P\) (recall that P is fixed).

The proof of Theorem 1 goes in two steps. First, we will bound the time of Leapfrog Join by bounding the time \(T_i\) of each for-loop \(\mathtt {?x}_i\) of Algorithm 1. Then for each level \(\mathtt {?x}_i\) we define a new RDF graph \(G_{i}\) of size less than G such that \(T_i = |\llbracket P \rrbracket _{G_i}| \le 2^{\text {MIN}(P, G)}\). The proof will follow by taking the sum over all \(T_i\).

Fix a variable \(\mathtt {?x}_i\) and denote by \(\bar{x}_{i-1} = \mathtt {?x}_1, \ldots , \mathtt {?x}_{i-1}\) the order of variables before \(\mathtt {?x}_i\) (for the sake of simplification, in the sequel we consider \(\bar{x}_i\) also as a set). We start by bounding the time of the for-loop in Algorithm 1 corresponding to \(\mathtt {?x}_i\). For this, consider the following extension of \({\text {LF}}_G\) over \(\bar{x}_{i-1}\):

$$ {\text {LF}}_G(P,\bar{x}_{i-1}) \ = \ \{\mu \mid {\text {dom}}(\mu ) = \bar{x}_{i-1} \text { and } \llbracket \mu (t) \rrbracket _G \ne \varnothing \text { for all } t \in P\} $$

Clearly, the number of times that the for-loop of \(\mathtt {?x}_i\) will be called is given by \(|{\text {LF}}_G(P,\bar{x}_{i-1})|\). Then for each \(\mu \in {\text {LF}}_G(P,\bar{x}_{i-1})\) the Leapfrog procedure \({\text {LF}}_G(\mu (P),\mathtt {?x}_i)\) is called taking time at most \(\min _{t \in \mu (P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\) (omitting the \(\log (|G|)\) factor for the moment). If we call \(T_i\) the number of steps that Algorithm 1 spends in the for-loop of \(\mathtt {?x}_i\), we have that:

$$ T_i \ = \ \sum _{\mu \in {\text {LF}}_G(P,\bar{x}_{i-1})} \min _{t \in \mu (P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)| $$

One can easily check that the total time of Algorithm 1 is given by \((\sum _{i=1}^n T_i)\cdot \log (G)\). Therefore, if we bound \(T_i\) by the AGM bound of P and G, then the worst case optimality of Leapfrog Join will be proven (recall that our analysis is in data complexity, omitting factors that depend on the size of P).

To bound the size of \(T_i\), we build an RDF graph \(G_i\) such that \(T_i = |\llbracket P \rrbracket _{G_i}|\) and the size of \(G_i\) is less that the size of G. Let \(\bot \) be a dummy value. To build \(G_i\) define the set of mappings \(U_i\) such that \(\mu \in U_i\) if and only if there exists \(\mu ' \in {\text {LF}}_G(P,\bar{x}_{i-1})\) such that:

  1. 1.

    ep2mm

  2. 1.

    \(\mu (\mathtt {?x}) = \mu '(\mathtt {?x})\) for every \(\mathtt {?x}\in \bar{x}_{i-1}\),

  3. 2.

    \(1 \le \mu (\mathtt {?x}_i) \le \min _{t \in \mu '(P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\), and

  4. 3.

    \(\mu (\mathtt {?x}) = \bot \) for every \(\mathtt {?x}\in \{\mathtt {?x}_{i+1}, \ldots , \mathtt {?x}_{n}\}\).

In other words, \(U_i\) contains all mappings built from mappings of \({\text {LF}}_G(P,\bar{x}_{i-1})\) and extended by assigning to \(\mathtt {?x}_i\) any value less than the time for computing \({\text {LF}}_G(\mu '(P),\mathtt {?x}_i)\). From \(U_i\) we can build the RDF graph \(G_i\) as follows:

$$ G_i \ = \ \bigcup _{\mu \in U_i} \mu (P). $$

By construction, note that the size of \(G_i\) is less than the size of G. Furthermore, we have that \(T_i = |\llbracket P \rrbracket _{G_i}|\). Indeed, for each \(\mu ' \in {\text {LF}}_G(P,\bar{x}_{i-1})\) we will have \(\big (\min _{t \in \mu '(P) : \mathtt {?x}_i \in {\text {var}}(t)} |\pi _{\mathtt {?x}_i}(\llbracket t \rrbracket _G)|\big )\) different mappings in \(\llbracket P \rrbracket _{G_i}\) and vice versa.

To finish the proof, recall the linear program associated to P and G, and its minimum value \(\text {MIN}(P,G)\). Consider also the same linear program but now for P and \(G_i\). Given that \(G_i\) is of size less than G, then the minimization function associated to the linear program of P and \(G_i\) always satisfies:

$$ \sum _{t\in P} x_t \cdot \log (|\llbracket t \rrbracket _{G_i}|) \le \sum _{t\in P} x_t \cdot \log (|\llbracket t \rrbracket _{G}|). $$

Therefore, we can conclude that \(\text {MIN}(P,G_i) \le \text {MIN}(P,G)\) and thus:

$$ T_i \ = \ |\llbracket P \rrbracket _{G_i}| \ \le \ 2^{\text {MIN}(P,G_i)} \ \le \ 2^{\text {MIN}(P,G)} $$

where the second inequality follows by the AGM bound. Given that each \(T_i\) is bounded by \(2^{\text {MIN}(P,G)}\) we conclude that the overall time is bounded by \(n\cdot 2^{\text {MIN}(P,G)} \cdot \log (G)\) and that Leapfrog Join is worst-case optimal.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hogan, A., Riveros, C., Rojas, C., Soto, A. (2019). A Worst-Case Optimal Join Algorithm for SPARQL. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11778. Springer, Cham. https://doi.org/10.1007/978-3-030-30793-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30793-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30792-9

  • Online ISBN: 978-3-030-30793-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics