A More General Theory of Static Approximations for Conjunctive Queries

Abstract

Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. Approximating a hard CQ by a query from such a fragment can thus allow for an efficient approximate evaluation. While underapproximations (i.e., approximations that return correct answers only) are well-understood, the dual notion of overapproximations (i.e, approximations that return complete – but not necessarily sound – answers), and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations has been open. This article establishes a connection between overapproximations and existential pebble games that allows for studying such problems systematically. Building on this connection, it is shown that the evaluation and identification problem for overapproximations can be solved in polynomial time. While the general existence problem remains open, the problem is shown to be decidable in 2EXPTIME over the class of acyclic CQs and in PTIME for Boolean CQs over binary schemata. Additionally we propose a more liberal notion of overapproximations to remedy the known shortcoming that queries might not have an overapproximation, and study how queries can be overapproximated in the presence of tuple generating and equality generating dependencies. The techniques are then extended to symmetric difference approximations and used to provide several complexity results for the identification, existence, and evaluation problem for this type of approximations.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    Recall that the symmetric difference between sets A and B is (AB) ∪ (BA).

References

  1. 1.

    Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    Google Scholar 

  2. 2.

    Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003)

  3. 3.

    Bárány, V., Gottlob, G., Otto, M.: Querying the guarded fragment. Logical Methods in Computer Science 10(2) (2014)

  4. 4.

    Barceló, P.: Querying graph databases. In: PODS, pp. 175–188 (2013)

  5. 5.

    Barceló, P., Gottlob, G., Pieris, A.: Semantic acyclicity under constraints. In: PODS, pp. 343–354 (2016)

  6. 6.

    Barceló, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. In: PODS, pp. 249–260 (2012)

  7. 7.

    Barceló, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. SIAM J. Comput. 43(3), 1085–1130 (2014)

    MathSciNet  Article  Google Scholar 

  8. 8.

    Barceló, P., Romero, M., Vardi, M.Y.: Semantic acyclicity on graph databases. SIAM J. Comput. 45(4), 1339–1376 (2016)

    MathSciNet  Article  Google Scholar 

  9. 9.

    Blumensath, A., Otto, M., Weyer, M.: Decidability results for the boundedness problem. Logical Methods in Computer Science 10(3) (2014)

  10. 10.

    Calì, A., Gottlob, G., Kifer, M.: Taming the infinite chase: Query answering under expressive relational constraints. In: KR, pp. 70–80 (2008)

  11. 11.

    Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: STOC, pp. 77–90 (1977)

  12. 12.

    Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. Theor. Comput. Sci. 239(2), 211–229 (2000)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Chen, H., Dalmau, V.: Beyond hypertree width: decomposition methods without decompositions. In: CP, pp. 167–181 (2005)

  14. 14.

    Cosmadakis, S.S., Gaifman, H., Kanellakis, P.C., Vardi, M.Y.: Decidable optimization problems for database logic programs (Preliminary Report). In: STOC, pp. 477–490 (1988)

  15. 15.

    Dalmau, V., Kolaitis, P.G., Vardi, M.Y.: Constraint satisfaction, bounded treewidth, and finite-variable logics. In: CP, pp. 310–326 (2002)

  16. 16.

    Deutsch, A., Nash, A., Remmel, J.B.: The chase revisisted. In: PODS, pp. 149–158 (2008)

  17. 17.

    Fagin, R.: A normal form for relational databases that is based on domains and keys. ACM Trans. Database Syst. 6(3), 387–415 (1981)

    Article  Google Scholar 

  18. 18.

    Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: Semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)

    MathSciNet  Article  Google Scholar 

  19. 19.

    Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: From intractable to polynomial time. PVLDB 3(1), 264–275 (2010)

    Google Scholar 

  20. 20.

    Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)

  21. 21.

    Fischl, W., Gottlob, G., Pichler, R.: General and fractional hypertree decompositions: hard and easy cases. In: PODS, pp. 17–32 (2018)

  22. 22.

    Gaifman, H., Mairson, H.G., Sagiv, Y., Vardi, M.Y.: Undecidable optimization problems for database logic programs. J. ACM 40(3), 683–713 (1993)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Garofalakis, M., Gibbon, P.: Approximate query processing: taming the terabytes. In: VLDB, p. 725 (2001)

  24. 24.

    Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: PODS, pp. 57–74 (2016)

  25. 25.

    Gottlob, G., Leone, N., Scarcello, F.: Hypertree decompositions and tractable queries. J. Comput. Syst. Sci. 64(3), 579–627 (2002)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Gottlob, G., Miklós, Z., Schwentick, T.: Generalized hypertree decompositions: NP-hardness and tractable variants. J ACM 56(6), 30:1–30:32 (2009)

  27. 27.

    Greco, G., Scarcello, F.: The power of local consistency in conjunctive queries and constraint satisfaction problems. SIAM J. Comput. 46(3), 1111–1145 (2017)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Grohe, M., Marx, D.: Constraint solving via fractional edge covers. In: SODA, pp. 289–298 (2006)

  29. 29.

    Hell, P., Nesetril, J.: The core of a graph. Discret. Math. 109(1-3), 117–126 (1992)

    MathSciNet  Article  Google Scholar 

  30. 30.

    Hell, P., Nesetril, J., Zhu, X.: Complexity of tree homomorphisms. Discret. Appl. Math. 70(1), 23–36 (1996)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Hell, P., Nešeťril, J.: Graphs and Homomorphisms. Oxford University Press, Oxford (2004)

    Google Scholar 

  32. 32.

    Ioannidis, Y.: Approximations in database systems. In: ICDT, pp. 16–30 (2003)

  33. 33.

    Kolaitis, P.G., Panttaja, J.: On the complexity of existential pebble games. In: CSL, pp. 314–329 (2003)

  34. 34.

    Kolaitis, P.G., Vardi, M.Y.: On the expressive power of datalog: Tools and a case study. J. Comput. Syst. Sci. 51(1), 110–134 (1995)

    MathSciNet  Article  Google Scholar 

  35. 35.

    Kolaitis, P.G., Vardi, M.Y.: Conjunctive-query containment and constraint satisfaction. J. Comput. Syst. Sci. 61(2), 302–332 (2000)

    MathSciNet  Article  Google Scholar 

  36. 36.

    Liu, Q.: Approximate query processing. In: Encyclopedia of Database Systems, pp 113–119 (2009)

  37. 37.

    Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of data dependencies. ACM Trans. Database Syst. 4(4), 455–469 (1979)

    Article  Google Scholar 

  38. 38.

    Otto, M.: The boundedness problem for monadic universal first-order logic. In: LICS, pp. 37–48 (2006)

  39. 39.

    Papadimitriou, C.H., Yannakakis, M.: On the complexity of database queries. J. Comput. Syst. Sci. 58(3), 407–427 (1999)

    MathSciNet  Article  Google Scholar 

  40. 40.

    Yannakakis, M.: Algorithms for acyclic database schemes. In: VLDB, pp. 82–94 (1981)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Pablo Barceló.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Special Issue on Database Theory (2018)

Barceló is funded by Millennium Institute for Foundational Research on Data and Fondecyt Grant 1170109. Zeume acknowledges the financial support by the European Research Council (ERC), grant agreement No 683080. Romero and Zeume thank the Simons Institute for the Theory of Computing for hosting them. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 714532). The paper reflects only the authors’ views and not the views of the ERC or the European Commission. The European Union is not liable for any use that may be made of the information contained therein.

Appendix

Appendix

Proof

(Theorem 1) Fix k > 1. The CQ q is defined over graphs, i.e., over a schema with a single binary relation symbol E, and consists of k + 1 variables v1,…,vk+ 1. For every 1 ≤ i < jk + 1 we add either the atom (i.e., edge) E(vi,vj) or E(vj,vi) to q in such a way that the subgraph of G induced by {v1,v2,v3} is a directed cycle and a certain condition (‡), defined below, holds. We start introducing some terminology.

Let G be a directed graph on nodes v1,…,vk+ 1 that contains, for each 1 ≤ i < jk + 1, either the edge E(vi,vj) or E(vj,vi). For a B ⊆{v1,…,vk+ 1} of size 1 ≤ł ≤ k − 1 and a node v ∈{v1,…,vk+ 1}∖ B, we define conn(v,B) as the tuple (e1,…,ek+ 1) ∈{− 1, 1, #}k+ 1 such that for each 1 ≤ pk + 1:

$$ e_{p} \ = \ \left\{\begin{array}{lll} \#, & \text{if } v_p \not\in B, \\ 1, & \text{if } v_p \in B \text{ and the edge } E(v,v_p) \text{ is in } G, \\ -1, & \text{otherwise, i.e., } v_p \in B \text{ and } E(v_p,v) \text{ is in } G. \end{array}\right. $$

In simple terms, conn(v,B) specifies how v connects with the nodes in B.

Our condition (‡) then establishes the following:

  1. (‡)

    For each B ⊆{v1,…,vk+ 1} of size 2 ≤ł ≤ k − 1 and each node v in {v1,…,vk+ 1}∖ B, there is a node v∈{v1,…,vk+ 1}∖ B such that

    $$ {\textsf conn}(v,B) \quad \neq \quad {\textsf conn}(v^{\prime},B). $$

    That is, for each such B and v we will always be able to find another v outside B that connects to the nodes in B in a different way than v.

Example 6

The graphs in Fig. 6 satisfy this condition for k = 2, 3, 4, respectively. Notice that the directed cycle on nodes {v1,v2,v3}, shown in the left-hand side, satisfies condition (‡) trivially.

Fig. 6
figure6

Directed graphs that satisfy condition (‡) for k = 2,3,4, respectively

The next lemma establishes that for each k > 1 there is always a graph that satisfies this condition.

Lemma 9

For eachk > 1, there is adirected graph G on nodesv1,…,vk+ 1such that the following hold:

  1. 1.

    For each 1 ≤ i < jk + 1,either the edgeE(vi,vj) orE(vj,vi) is in G;

  2. 2.

    the subgraph of G induced by {v1,v2,v3} is a directed cycle; and

  3. 3.

    G satisfies condition().

Proof

(Lemma 9) For k = 2 this is given by the graph in Example 6. For k ≥ 3 we prove by induction a stronger claim: There is a directed graph G on nodes v1,…,vk+ 1 such that:

  1. 1.

    G contains either the edge E(vi,vj) or E(vj,vi) for each 1 ≤ i < jk + 1.

  2. 2.

    The subgraph of G induced by {v1,v2,v3} is a directed cycle.

  3. 3.

    G contains the edges E(v1,v2) and E(v4,v3).

  4. 4.

    G satisfies condition (‡).

The basis case k = 3 is given again by the graph in Example 6. For the inductive case, assume by induction hypothesis that there is a directed graph G on nodes v1,…,vk+ 1 that satisfies the claim above. A new graph G is then created from G by adding a new node vk+ 2 and connecting it to the nodes in {v1,…,vk+ 1} as follows: For each 1 ≤ ik, if E(vi,vi+ 1) is in G then we add the edge E(vk+ 2,vi) to G, otherwise we add the edge E(vi,vk+ 2). Moreover, if E(vk+ 1,v1) is in G then we add the edge E(vk+ 2,vk+ 1) to G, otherwise we add the edge E(vk+ 1,vk+ 2). Notice that G coincides with the subgraph of G that is induced by nodes v1,…,vk+ 1. Moreover, by construction G satisfies the first three conditions of the claim. We prove next that it also satisfies condition (‡).

Take an arbitrary B ⊆{v1,…,vk+ 2} of size 2 ≤ł ≤ k and a node v outside B. We prove that the condition holds by cases:

  • vk+ 2B, v ∈{v1,…,vk+ 1}, and 2 ≤ł ≤ k − 1: By inductive hypothesis there is a node v∈{v1,…,vk+ 1}∖ B such that conn(v,B)≠conn(v,B).

  • vk+ 2B, v ∈{v1,…,vk+ 1}, and ł = k: We set v := vk+ 2 and claim that the predecessor u of v in {v1,…,vk+ 1} distinguishes v and v. Here, the “predecessor” of vi is vi− 1 if 2 ≤ ik + 1, and the predecessor of v1 is vk+ 1 (note that uB as ł = k). By construction of G, we have that E(u,v) ∈ G if and only if E(v,u) ∈ G. We conclude that conn(v,B)≠conn(v,B).

  • vk+ 2B and v = vk+ 2: There must exist some node v in {v1,…,vk+ 1} that does not belong to B but its predecessor u in {v1,…,vk+ 1} does. Then by construction of G, we have that E(u,v) ∈ G if and only if E(v,u) ∈ G. We conclude that conn(v,B)≠conn(v,B).

  • vk+ 2B and ł ≥ 3: Then B = B ∖{vk+ 2} is of size 2 ≤ł − 1 ≤ k − 1. By induction hypothesis, for every node v outside B there is another node v∈{v1,…,vk+ 1}∖ B such that conn(v,B)≠conn(v,B). This implies that conn(v,B)≠conn(v,B).

  • vk+ 2B and ł = 2: Then B = {vk+ 2,u} for some u ∈{v1,…,vk+ 1}. Suppose first that u ∈{v1,v2,v3}. Since the subgraph induced by {v1,v2,v3} in G defines a directed cycle, it is the case that E(u,z) holds if and only if E(w,u) holds, where {u,w,z} = {v1,v2,v3}. Therefore, for each v ∈{v1,…,vk+ 1}∖ B there is a node v∈{z,w} such that conn(v,{u})≠conn(v,{u}). It follows that conn(v,B)≠conn(v,B). Suppose now that u∉{v1,v2,v3}. It suffices to exhibit two nodes v and v outside B such that E(v,vk+ 2) and E(vk+ 2,v). By induction hypothesis the edges E(v1,v2) and E(v4,v3) are in G. Therefore, vk+ 2 is connected via edges E(v3,vk+ 2) and E(vk+ 2,v1) in G.

This concludes the proof of the lemma. □

Fix k ≥ 1. We then take as q any Boolean CQ whose canonical database is a graph G on nodes v1,…,v2k+ 1 that satisfies the conditions stated in Lemma 9. That is, (1) for each 1 ≤ i < j ≤ 2k + 1, either the edge E(vi,vj) or E(vj,vi) is in G, (2) the subgraph of G induced by {v1,v2,v3} is a directed cycle, and (3) G satisfies condition (‡). It is easy to see that q is in GHW(k + 1) ∖GHW(k) as its underlying undirected graph is a clique on 2k + 1 elements. In fact, these elements can be covered with (k + 1) edges, but not with k.

We claim that q has no GHW(ł)-overapproximation for any 1 ≤ł ≤ k. The proofs for the cases when ł = 1 and ł > 1 are slightly different. We start with the latter, i.e., when 1 < ł ≤ k. The proof for every such an ł is analogous, and thus we concentrate on proving the claim for ł = k > 1. According to Theorem 7, we need to prove that there is no constant c ≥ 0 such that for every database \(\mathcal {D}\) it holds that

$$ q \to_{k} \mathcal{D} \quad \Longleftrightarrow \quad q {\to_{k}^{c}} \mathcal{D}. $$

It is sufficient to show then that for each integer c ≥ 0 there is a database \(\mathcal {D}\) such that

$$ q {\to_{k}^{c}} \mathcal{D} \ \ \text{ but } \ \ q \not\to_{k}^{c+1} \mathcal{D}. $$

Or, equivalently, that for each integer c ≥ 0 there is a database \(\mathcal {D}\) such that

$$ q_{c} \to \mathcal{D} \ \ \text{ but } \ \ q_{c+1} \not\to \mathcal{D}, $$

where qc, for c ≥ 0, is the CQ which is defined in Lemma 1, i.e., for every \(\mathcal {D}\) it is the case that \(q {\to _{k}^{c}} \mathcal {D}\) iff \(q_{c} \to \mathcal {D}\). In view of (1), this boils down to proving that

$$ q_{c+1} \not\to q_{c}, \ \ \ \text{for each \(c \geq 0\).} $$
(8)

We prove (8) by induction. The claim clearly holds for c = 0, as by definition q0 is empty while q1 is not. Let us assume now that the claim holds for c ≥ 0. That is, qc+ 1qc. This means, in particular, that the core of qc+ 1 is not contained in qc. That is, this core contains at least one node w in qc+ 1 that does not belong to qc.

By the way q is defined, any k-union of q must be of the form S ⊆{v1,…,v2k+ 1} with |S| = 2k. Let us consider now (Tc+ 1,βc+ 1) as defined in the proof of Lemma 1. Since wqc, it must be the case that there is a unique node t of Tc+ 1 such that wβc+ 1(t). Moreover, this t must be a leaf of Tc+ 1. Suppose that ϕt(w) = v, for v ∈{v1,…,v2k+ 1}, where ϕt is as defined in the proof of Lemma 1, i.e., ϕt is a bijection between βc+ 1(t) and the k-union S ⊆{v1,…,v2k+ 1} of q such that λc+ 1(t) = S.

Notice, by definition, that if the parent of t in Tc+ 1 is t, then either λc+ 1(t) = – which holds precisely when t is the root of Tc+ 1 –, or λc+ 1(t) = S, where S is the subset of {v1,…,v2k+ 1} which contains all elements save for v. That is, in the latter case we have that S is obtained from S by replacing some element v in {v1,…,v2k+ 1}, with vv, by v itself.

From Proposition 1, we can assume that the homomorphism that maps qc+ 1 to its core is a retraction, i.e., it is the identity on the nodes of this core, in particular, on w. On the other hand, w is linked in qc+ 1 exclusively with the remaining nodes that appear in βc+ 1(t). Moreover, the graph induced by the nodes in λc+ 1(t) is a clique on 2k elements, and thus all the elements in βc+ 1(t) must belong to the core of qc+ 1.

Recall that ϕt(w) = v. Take an arbitrary node vS that is not v. Notice that neither v = v as vS, while vS. By definition, Tc+ 2 contains a leaf t whose parent is t such that λc+ 2(t) = S, where S is the subset of {v1,…,v2k+ 1} which is obtained from S by replacing v with the unique node in {v1,…,v2k+ 1}∖ S, namely v. Let us assume that \(\phi _{t^{\prime \prime }}(v^{\prime }) = w^{\prime \prime }\). Notice that w appears in no other node in (Tc+ 2,βc+ 2).

Assume now, for the sake of contradiction, that qc+ 2qc+ 1. Then the core of qc+ 2 is the same than the core of qc+ 1. Let C be this core. Henceforth, from Proposition 1 there is a retraction h from qc+ 2 to C. Since all elements in βc+ 2(t) = βc+ 1(t) are in C, the homomorphism h must be the identity on them. But then h maps w to the unique element in qc+ 1 that is linked to exactly the same nodes than w in qc+ 2; namely, ϕt(v) = w.

Suppose that v and v represent the nodes vi and vj in {v1,…,v2k+ 1}, respectively. By assumption, ij. But this implies then that in the canonical database G of q we have that

$$ {\textsf conn}(v_{i}, B) \ = \ {\textsf conn}(v_{j}, B), $$

where B = {v1,…,v2k+ 1}∖{vi,vj}. This is a contradiction since B is of size 2k − 1 > 1 and G satisfies condition (‡). This concludes our proof that q has no GHW(k)-overapproximation (and, analogously, that it has no GHW(ł)-overapproximation for any 1 < ł ≤ k).

We prove next that q neither has a GHW(1)-overapproximation. Let us assume, for the sake of contradiction, that q has a GHW(1)-overapproximation q. It is an easy observation that the directed graphs in GHW(1) are precisely those whose underlying undirected graph is acyclic. Notice also that q has no directed cycles of length two (i.e., atoms of the form E(u,v) and E(v,u)); otherwise, since qq, we would have that q also has such a cycle (which we know it does not). Using the fact that q∈GHW(1) and has no directed cycles of length two, it is not difficult to show (see e.g. [31]) that there is a sufficiently large integer n ≥ 1 such that, if Pn is the directed path on n vertices, then

$$ q^{\prime} \to \mathbf{P}_{n} \ \ \text{ but } \ \ \mathbf{P}_{n} \not\to q^{\prime}. $$

This implies that if q is the Boolean CQ which is naturally defined by Pn, then \(q^{\prime \prime } \subsetneq q^{\prime }\). Moreover, PnG. This is due to the fact that G contains a directed cycle on {v1,v2,v3}. We conclude that

$$ q \subseteq q^{\prime\prime} \subsetneq q^{\prime}, $$

and, therefore, that q is not a GHW(1)-overapproximation of q. This is a contradiction. We then conclude the proof of Theorem 1. □

Proof

(Lemma 8) Before proving the lemma, we need some terminology and claims. Let \(\mathcal {D}\) be a database and (A1,…,An) be a tuple of pairwise-disjoint subsets of elements of \(\mathcal {D}\), where n ≥ 0. In addition, let \(\mathcal {D}^{\prime }\) be a database and (a1,…,an) a tuple of elements in \(\mathcal {D}^{\prime }\). Then we write \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (\mathcal {D}^{\prime },(a_{1},\dots ,a_{n}))\) iff there is a homomorphism h from \(\mathcal {D}\) to \(\mathcal {D}^{\prime }\) such that, for each i ∈{1,…,n} and aAi, it is the case that h(a) = ai.

For such a pair \((\mathcal {D},(A_{1},\dots ,A_{n}))\), with n ≥ 0, we define its generalized hypertreewidth in the natural way. The intuition is that we see \((\mathcal {D},(A_{1},\dots ,A_{n}))\) as a “query”, where A1 ∪⋯ ∪ An are the “free variables” and the rest of the elements are the “existential variables”. Formally, a tree decomposition of \((\mathcal {D},(A_{1},\dots ,A_{n}))\) is a pair (T,χ), where T is a tree and χ is a mapping that assigns a subset of the elements in \(\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n})\) to each node tT, such that the following statements hold:

  1. 1.

    For each atom \(R(\bar a)\) in \(\mathcal {D}\), it is the case that \(\bar a\cap (\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n}))\) is contained in χ(t), for some tT.

  2. 2.

    For each element a in \(\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n})\), the set of nodes tT for which a occurs in χ(t) is connected.

The width of node t in (T,χ) is the minimal number for which there are atoms in \(\mathcal {D}\) covering χ(t), i.e., atoms \(R(\bar a_{1}),\dots ,R(\bar a_{\ell })\) in \(\mathcal {D}\) such that \(\chi (t)\subseteq \bigcup _{1\leq i \leq \ell } \bar a_{i}\) The width of (T,χ) is the maximal width of the nodes of T.

The generalized hypertreewidth of \((\mathcal {D},(A_{1},\dots ,A_{n}))\) is the minimum width of its tree decompositions.

By mimicking the proof of the forward implication of Proposition 3, we can show the following:

Lemma 10

Fixk ≥ 1.Let\(q(\bar x),q^{\prime }(\bar x^{\prime })\)beCQs, where\(\bar x=(x_{1},\dots ,x_{n})\)and\(\bar x^{\prime }=(x_{1}^{\prime },\dots ,x_{n}^{\prime })\),forn ≥ 0.Suppose that\((q,\bar x)\to _{k} (q^{\prime },\bar x^{\prime })\).Then, for each database\(\mathcal {D}\)andtuple (A1,…,An) of subsets of\(\mathcal {D}\)suchthat\((\mathcal {D},(A_{1},\dots ,A_{n}))\)hasgeneralized hypertreewidth at most k, it is the case that

$$ \begin{array}{@{}rcl@{}} (\mathcal{D},(A_{1},\dots,A_{n}))\to (q,(x_{1},\dots,x_{n})) \quad \Longrightarrow \\ (\mathcal{D},(A_{1},\dots,A_{n}))\to (q^{\prime},(x_{1}^{\prime},\dots,x_{n}^{\prime})). \end{array} $$

Proof

Let \(\mathcal {H}\) be a winning strategy for Duplicator witnessing the fact that \((q,\bar x)\to _{k} (q^{\prime },\bar x^{\prime })\). Let us assume that \((\mathcal {D},(A_{1},\dots ,A_{n}))\) has generalized hypertreewidth at most k, and that \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (q,(x_{1},\dots ,x_{n}))\) is witnessed via a homomorphism h. Then we can compose h with the strategy \(\mathcal {H}\) to define a homomorphism g witnessing \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (q^{\prime },(x_{1}^{\prime },\dots ,x_{n}^{\prime }))\). The mapping g is defined in a top-down fashion over the tree decomposition (T,χ) of width at most k of \((\mathcal {D},(A_{1},\dots ,A_{n}))\). One starts at the root r of T, and forces Spoiler to play his pebbles over the set h(χ(r)). If Duplicator responds according to \(\mathcal {H}\) with a partial homomorphism fr, we then let g(a) = fr(h(a)), for each aχ(r). We then move to each child of r and so on, until all leaves are reached and g is defined over all elements in \(\mathcal {D}\setminus (A_{1}\cup \cdots \cup A_{n})\). Since Duplicator responds to Spoiler’s moves with consistent partial homomorphisms, we have that g is actually a well-defined homomorphism from \((\mathcal {D},(A_{1},\dots ,A_{n}))\) to \((q^{\prime },(x_{1}^{\prime },\dots ,x_{n}^{\prime }))\). □

Now we are ready to show our lemma. Suppose that \((q,\bar x)\to _{k}(q^{\prime },\bar x^{\prime })\), where \(\bar x=(x_{1},\dots ,x_{n})\) and \(\bar x^{\prime }=(x_{1}^{\prime },\dots ,x_{n}^{\prime })\), for some n ≥ 0. Assume that \((q^{\prime \prime },\bar x^{\prime \prime })\to (q^{\prime }\wedge q, \bar z)\) via a homomorphism h, for \(q^{\prime \prime }(\bar x^{\prime \prime })\in \textsf {GHW}(k)\), and suppose that \(\bar x^{\prime \prime }=(x_{1}^{\prime \prime },\dots ,x_{n}^{\prime \prime })\) and \(\bar z=(z_{1},\dots ,z_{n})\). For each i ∈{1,…,n}, we define Vi to be the set of variables x in q such that h(x) = zi. In particular, \(x_{i}^{\prime \prime }\in V_{i}\), for each i ∈{1,…,n}. We define V to be the set of variables x in q such that h(x) = y, where y is an existentially quantified variable of q. Similarly, we define V with respect to the existentially quantified variables of q. Note that the sets V,V,V1,…,Vn form a partition of the variables of q.

Recall that \(\mathcal {D}_{q^{\prime \prime }}\) be the canonical database of q. Since q∈GHW(k), we know that

$$ \left( \mathcal{D}_{q^{\prime\prime}}, (\{x_{1}^{\prime\prime}\},\dots,\{x_{n}^{\prime\prime}\})\right) $$

has generalized hypertreewidth at most k, as defined above. Let \(\mathcal {D}_{V}\) be the database induced in \(\mathcal {D}_{q^{\prime \prime }}\) by the set of variables VV1 ∪⋯ ∪ Vn, i.e., the set of atoms \(R(\bar t)\in \mathcal {D}_{q^{\prime \prime }}\) such that each element in \(\bar t\) is in VV1 ∪⋯ ∪ Vn. We now show that

$$ \left( \mathcal{D}_{V},(V_{1},\dots,V_{n})\right) $$

has also generalized hypertreewidth at most k. Indeed, let (T,χ) be the tree decomposition of \((\mathcal {D}_{q^{\prime \prime }}\), \((\{x_{1}^{\prime \prime }\},\dots ,\{x_{n}^{\prime \prime }\}))\) of width at most k. Define χ such that for each tT, we have that χ(t) = χ(t) ∩ V. We claim that (T,χ) is a tree decomposition of \((\mathcal {D}_{V}\), (V1,…,Vn)) of width at most k.

In fact, since (T,χ) is a tree decomposition, we have that, for each aV, it is the case that the set {tTaχ(t)} is connected; and for each atom \(R(\bar a)\in \mathcal {D}_{V}\), there is a node tT such that \(\bar a\cap V\subseteq \chi ^{\prime }(t)\). To see that the width of (T,χ) is bounded by k, let t be a node in T. Since the width of (T,χ) is at most k, there are atoms \(R(\bar a_{1}),\dots ,R(\bar a_{\ell })\) in \(\mathcal {D}_{q^{\prime \prime }}\), with k, such that \(\chi (t)\subseteq \bigcup _{1\leq i \leq \ell } \bar a_{i}\). Let \(R(\bar a_{i_{1}}),\dots ,R(\bar a_{i_{p}})\), where 1 ≤ i1 < ⋯ < ip and p, be the atoms in \(\{R(\bar a_{1}),\dots ,R(\bar a_{\ell })\}\) that contain an element in χ(t). Since χ(t) ⊆ χ(t), it is the case that \(\chi ^{\prime }(t)\subseteq \bigcup _{1\leq j \leq p} \bar a_{i_{j}}\). It suffices to show that each \(R(\bar a_{i_{j}})\) is actually an atom in \(\mathcal {D}_{V}\), for 1 ≤ jp. Towards a contradiction, suppose that this is not the case. Then, there is an atom in \(\mathcal {D}_{q^{\prime \prime }}\) that contains simultaneously one variable in χ(t) ⊆ V and one variable in V. By the definitions of V and V, and the fact that h is a homomorphism, it follows that there is an atom in \((q^{\prime }\wedge q)(\bar z)\) that mentions simultaneously one existentially quantified variable from q and one from q; this contradicts the definition of \((q^{\prime }\wedge q)(\bar z)\). We conclude that the generalized hypertreewidth of \((\mathcal {D}_{V},(V_{1},\dots ,V_{n}))\) is at most k.

Recall that h is our initial homomorphism from \((q^{\prime \prime },\bar x^{\prime \prime })\) to \((q^{\prime }\wedge q, \bar z)\). Let hV be the restriction of h to the set VV1 ∪⋯ ∪ Vn. By construction,

$$ \left( \mathcal{D}_{V},(V_{1},\dots,V_{n})\right) \to \left( q,(x_{1},\dots,x_{n})\right) $$

via homomorphism hV. We can then apply Lemma 10 and obtain that

$$ \left( \mathcal{D}_{V},(V_{1},\dots,V_{n})\right) \to \left( q^{\prime},(x_{1}^{\prime},\dots,x_{n}^{\prime})\right) $$

via a homomorphism h. We define our required homomorphism g from \((q^{\prime \prime },\bar x^{\prime \prime })\) to \((q^{\prime },\bar x^{\prime })\) as follows: if aVV1 ∪⋯ ∪ Vn, then g(a) = h(a); otherwise, if aV, then g(a) = h(a). To see that g is a homomorphism, it suffices to consider an atom \(R(\bar a)\in \mathcal {D}_{q^{\prime \prime }}\) such that \(\bar a\) contains an element in V and one element not in V, and show that \(R(g(\bar a))\in \mathcal {D}_{q^{\prime }}\). Let A be the set of elements in \(\bar a\) that are not in V. As mentioned above, there are no atoms in \(\mathcal {D}_{q^{\prime \prime }}\) mentioning elements in V and V simultaneously, thus AV1 ∪⋯ ∪ Vn. In particular, h(a) = h(a), for each aA. It follows that \(R(g(\bar a))=R(h(\bar a))\), from which we conclude that \(R(g(\bar a))\in \mathcal {D}_{q^{\prime }}\). □

Proof

(Proposition 10) Consider the Boolean CQ q from Fig. 2, defined as

$$ q = \exists x\exists y\exists z \left( P_{a}(x,y)\wedge P_{a}(y,x) \wedge P_{a}(y,z) \wedge P_{a}(z,y) \wedge P_{b}(z,x) \wedge P_{b}(x,z)\right), $$

and the CQ q from the same figure defined by

$$ \begin{array}{@{}rcl@{}} q^{\prime}\ , = \exists x\exists y_{1}\exists y_{2}\exists z \left( P_{a}(x,y_{1})\wedge P_{a}(y_{1},x) \wedge P_{a}(y_{2},z)\right. \\ \left.\wedge P_{a}(z,y_{2}) \wedge P_{b}(z,x) \wedge P_{b}(x,z)\right). \end{array} $$

For each n ≥ 1, we define the CQ

$$ \begin{array}{@{}rcl@{}} q_{n} = \exists x_{1}{\cdots} \exists x_{n+1} \left( P_{a}(x_{1},x_{2})\wedge {\cdots} \wedge P_{a}(x_{n},x_{n+1})\wedge \right.\\ \left.P_{b}(x_{1},x_{1})\wedge P_{b}(x_{n+1},x_{n+1}\right). \end{array} $$

Observe that qqn ∈GHW(1), for each n ≥ 1. We now show that, for each n ≥ 1, qqn is an incomparable GHW(1)-Δ-approximation of q. As mentioned in Example 2, we have that q1q. In particular q1(qqn). Clearly, q↛(qqn). Also, qnq since variables x1 and xn+ 1 of qn cannot be mapped to any variable in q via a homomorphism. Therefore, (qqn)↛q. By Theorem 11, it follows that qqn is an incomparable GHW(1)-Δ-approximation of q.

Now we show that the CQs {qqn}n≥ 1 form a family of non-equivalent CQs. First note that qnq, for each n ≥ 1. Also, observe that qiqj iff i = j, for i,j ≥ 1. It follows that for each i,j ≥ 1, such that ij, it is the case that (qqi)↛(qqj) and (qqj)↛(qqi). In particular, {qqn}n≥ 1 is a family of non-equivalent CQs. □

Proof

(Proposition 11) As already mentioned, the c oNP upper bound follows directly from Theorem 11. For the lower bound, we consider the N on-Hom(H) problem, for a fixed directed graph H, which asks, given a directed graph G, whether GH. Let us assume that, for each k ≥ 1, there is a directed graph Hk such that:

  1. 1.

    Hk ∈GHW(k), or more formally, the Boolean CQ \(q_{H_k}\) whose canonical database is Hk belongs to GHW(k).

  2. 2.

    N on-Hom(Hk) is c oNP-complete even when the input directed graph G satisfies that HkG.

We later explain how to obtain these graphs Hk’s. Now we reduce from the restricted version of N on-Hom(Hk) given by item (2) above. Let G be a directed graph such that HkG. We first check in polynomial time whether GkHk. If GkHk, we output a fixed pair \(q_0,q_0^{\prime }\) such that \(q_0^{\prime }\in \textsf {GHW}(k)\) and \(q_0^{\prime }\) is an incomparable GHW(k)-Δ-approximation of q0. In case that GkHk, we output the pair \(q_{G}, q_{H_k}\), where qG and \(q_{H_k}\) are Boolean CQs whose canonical databases are precisely G and Hk, respectively. Since \(q_{H_k}\in \textsf {GHW}(k)\) by item (1) above, the reduction is well-defined.

Suppose first that GHk. If GkHk, then we are done, since \(q_0^{\prime }\) is an incomparable GHW(k)-Δ-approximation of q0. Otherwise, if GkHk, since GHk and HkG (item (2) above), Theorem 11 implies that \(q_{H_k}\) is an incomparable GHW(k)-Δ-approximation of qG. On the other hand, assume that GHk. In particular, we have that GkHk, and then, in this case, the reduction outputs the pair \(q_{G}, q_{H_k}\). Since GHk, we conclude that \(q_{H_k}\) is not an incomparable GHW(k)-Δ-approximation of qG.

It remains to define the directed graph Hk. If k ≥ 2, it suffices to consider the clique on 2k vertices, that is, the directed graph K2k whose vertex set is {1,…, 2k} and whose edges are {(i,j)∣ij, for i,j ∈{1,…, 2k}}. We have that K2k ∈GHW(k), and thus item (1) above is satisfied. Also, we can reduce from the non-2k-colorability problem by replacing each undirected edge {u,v} of a given undirected graph G, by a directed edge in an arbitrary direction, e.g., from u to v. Clearly, this is a reduction from non-2k-colorability to N on-Hom(K2k). Also note that the output f(G) of the reduction satisfies that K2kf(G), as f(G) has no directed loops nor directed cycles of length 2. Therefore, item (2) above is satisfied. For k = 1, it is known from [30] that there is an oriented tree T (i.e., a directed graph whose underlying undirected graph is a tree and has no directed cycles of length 1 (loops) and 2) such that N on-Hom(T) is c oNP-complete. Since T is an oriented tree then it belongs to GHW(1), and then item (1) is satisfied. Also, by inspecting the reduction in [30], we have that item (2) also holds. □

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Barceló, P., Romero, M. & Zeume, T. A More General Theory of Static Approximations for Conjunctive Queries. Theory Comput Syst 64, 916–964 (2020). https://doi.org/10.1007/s00224-019-09924-0

Download citation

Keywords

  • Conjunctive queries
  • Hypertreewidth
  • Approximations
  • Existential pebble game