Abstract
Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. Approximating a hard CQ by a query from such a fragment can thus allow for an efficient approximate evaluation. While underapproximations (i.e., approximations that return correct answers only) are well-understood, the dual notion of overapproximations (i.e, approximations that return complete – but not necessarily sound – answers), and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations has been open. This article establishes a connection between overapproximations and existential pebble games that allows for studying such problems systematically. Building on this connection, it is shown that the evaluation and identification problem for overapproximations can be solved in polynomial time. While the general existence problem remains open, the problem is shown to be decidable in 2EXPTIME over the class of acyclic CQs and in PTIME for Boolean CQs over binary schemata. Additionally we propose a more liberal notion of overapproximations to remedy the known shortcoming that queries might not have an overapproximation, and study how queries can be overapproximated in the presence of tuple generating and equality generating dependencies. The techniques are then extended to symmetric difference approximations and used to provide several complexity results for the identification, existence, and evaluation problem for this type of approximations.
Similar content being viewed by others
Notes
Recall that the symmetric difference between sets A and B is (A ∖ B) ∪ (B ∖ A).
References
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003)
Bárány, V., Gottlob, G., Otto, M.: Querying the guarded fragment. Logical Methods in Computer Science 10(2) (2014)
Barceló, P.: Querying graph databases. In: PODS, pp. 175–188 (2013)
Barceló, P., Gottlob, G., Pieris, A.: Semantic acyclicity under constraints. In: PODS, pp. 343–354 (2016)
Barceló, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. In: PODS, pp. 249–260 (2012)
Barceló, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. SIAM J. Comput. 43(3), 1085–1130 (2014)
Barceló, P., Romero, M., Vardi, M.Y.: Semantic acyclicity on graph databases. SIAM J. Comput. 45(4), 1339–1376 (2016)
Blumensath, A., Otto, M., Weyer, M.: Decidability results for the boundedness problem. Logical Methods in Computer Science 10(3) (2014)
Calì, A., Gottlob, G., Kifer, M.: Taming the infinite chase: Query answering under expressive relational constraints. In: KR, pp. 70–80 (2008)
Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: STOC, pp. 77–90 (1977)
Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. Theor. Comput. Sci. 239(2), 211–229 (2000)
Chen, H., Dalmau, V.: Beyond hypertree width: decomposition methods without decompositions. In: CP, pp. 167–181 (2005)
Cosmadakis, S.S., Gaifman, H., Kanellakis, P.C., Vardi, M.Y.: Decidable optimization problems for database logic programs (Preliminary Report). In: STOC, pp. 477–490 (1988)
Dalmau, V., Kolaitis, P.G., Vardi, M.Y.: Constraint satisfaction, bounded treewidth, and finite-variable logics. In: CP, pp. 310–326 (2002)
Deutsch, A., Nash, A., Remmel, J.B.: The chase revisisted. In: PODS, pp. 149–158 (2008)
Fagin, R.: A normal form for relational databases that is based on domains and keys. ACM Trans. Database Syst. 6(3), 387–415 (1981)
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: Semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005)
Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: From intractable to polynomial time. PVLDB 3(1), 264–275 (2010)
Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)
Fischl, W., Gottlob, G., Pichler, R.: General and fractional hypertree decompositions: hard and easy cases. In: PODS, pp. 17–32 (2018)
Gaifman, H., Mairson, H.G., Sagiv, Y., Vardi, M.Y.: Undecidable optimization problems for database logic programs. J. ACM 40(3), 683–713 (1993)
Garofalakis, M., Gibbon, P.: Approximate query processing: taming the terabytes. In: VLDB, p. 725 (2001)
Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: PODS, pp. 57–74 (2016)
Gottlob, G., Leone, N., Scarcello, F.: Hypertree decompositions and tractable queries. J. Comput. Syst. Sci. 64(3), 579–627 (2002)
Gottlob, G., Miklós, Z., Schwentick, T.: Generalized hypertree decompositions: NP-hardness and tractable variants. J ACM 56(6), 30:1–30:32 (2009)
Greco, G., Scarcello, F.: The power of local consistency in conjunctive queries and constraint satisfaction problems. SIAM J. Comput. 46(3), 1111–1145 (2017)
Grohe, M., Marx, D.: Constraint solving via fractional edge covers. In: SODA, pp. 289–298 (2006)
Hell, P., Nesetril, J.: The core of a graph. Discret. Math. 109(1-3), 117–126 (1992)
Hell, P., Nesetril, J., Zhu, X.: Complexity of tree homomorphisms. Discret. Appl. Math. 70(1), 23–36 (1996)
Hell, P., Nešeťril, J.: Graphs and Homomorphisms. Oxford University Press, Oxford (2004)
Ioannidis, Y.: Approximations in database systems. In: ICDT, pp. 16–30 (2003)
Kolaitis, P.G., Panttaja, J.: On the complexity of existential pebble games. In: CSL, pp. 314–329 (2003)
Kolaitis, P.G., Vardi, M.Y.: On the expressive power of datalog: Tools and a case study. J. Comput. Syst. Sci. 51(1), 110–134 (1995)
Kolaitis, P.G., Vardi, M.Y.: Conjunctive-query containment and constraint satisfaction. J. Comput. Syst. Sci. 61(2), 302–332 (2000)
Liu, Q.: Approximate query processing. In: Encyclopedia of Database Systems, pp 113–119 (2009)
Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of data dependencies. ACM Trans. Database Syst. 4(4), 455–469 (1979)
Otto, M.: The boundedness problem for monadic universal first-order logic. In: LICS, pp. 37–48 (2006)
Papadimitriou, C.H., Yannakakis, M.: On the complexity of database queries. J. Comput. Syst. Sci. 58(3), 407–427 (1999)
Yannakakis, M.: Algorithms for acyclic database schemes. In: VLDB, pp. 82–94 (1981)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection on Special Issue on Database Theory (2018)
Barceló is funded by Millennium Institute for Foundational Research on Data and Fondecyt Grant 1170109. Zeume acknowledges the financial support by the European Research Council (ERC), grant agreement No 683080. Romero and Zeume thank the Simons Institute for the Theory of Computing for hosting them. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 714532). The paper reflects only the authors’ views and not the views of the ERC or the European Commission. The European Union is not liable for any use that may be made of the information contained therein.
Appendix
Appendix
Proof
(Theorem 1) Fix k > 1. The CQ q is defined over graphs, i.e., over a schema with a single binary relation symbol E, and consists of k + 1 variables v1,…,vk+ 1. For every 1 ≤ i < j ≤ k + 1 we add either the atom (i.e., edge) E(vi,vj) or E(vj,vi) to q in such a way that the subgraph of G induced by {v1,v2,v3} is a directed cycle and a certain condition (‡), defined below, holds. We start introducing some terminology.
Let G be a directed graph on nodes v1,…,vk+ 1 that contains, for each 1 ≤ i < j ≤ k + 1, either the edge E(vi,vj) or E(vj,vi). For a B ⊆{v1,…,vk+ 1} of size 1 ≤ł ≤ k − 1 and a node v ∈{v1,…,vk+ 1}∖ B, we define conn(v,B) as the tuple (e1,…,ek+ 1) ∈{− 1, 1, #}k+ 1 such that for each 1 ≤ p ≤ k + 1:
In simple terms, conn(v,B) specifies how v connects with the nodes in B.
Our condition (‡) then establishes the following:
- (‡)
For each B ⊆{v1,…,vk+ 1} of size 2 ≤ł ≤ k − 1 and each node v in {v1,…,vk+ 1}∖ B, there is a node v′∈{v1,…,vk+ 1}∖ B such that
$$ {\textsf conn}(v,B) \quad \neq \quad {\textsf conn}(v^{\prime},B). $$That is, for each such B and v we will always be able to find another v′ outside B that connects to the nodes in B in a different way than v.
Example 6
The graphs in Fig. 6 satisfy this condition for k = 2, 3, 4, respectively. Notice that the directed cycle on nodes {v1,v2,v3}, shown in the left-hand side, satisfies condition (‡) trivially.
The next lemma establishes that for each k > 1 there is always a graph that satisfies this condition.
Lemma 9
For eachk > 1, there is adirected graph G on nodesv1,…,vk+ 1such that the following hold:
- 1.
For each 1 ≤ i < j ≤ k + 1,either the edgeE(vi,vj) orE(vj,vi) is in G;
- 2.
the subgraph of G induced by {v1,v2,v3} is a directed cycle; and
- 3.
G satisfies condition(‡).
Proof
(Lemma 9) For k = 2 this is given by the graph in Example 6. For k ≥ 3 we prove by induction a stronger claim: There is a directed graph G on nodes v1,…,vk+ 1 such that:
- 1.
G contains either the edge E(vi,vj) or E(vj,vi) for each 1 ≤ i < j ≤ k + 1.
- 2.
The subgraph of G induced by {v1,v2,v3} is a directed cycle.
- 3.
G contains the edges E(v1,v2) and E(v4,v3).
- 4.
G satisfies condition (‡).
The basis case k = 3 is given again by the graph in Example 6. For the inductive case, assume by induction hypothesis that there is a directed graph G on nodes v1,…,vk+ 1 that satisfies the claim above. A new graph G′ is then created from G by adding a new node vk+ 2 and connecting it to the nodes in {v1,…,vk+ 1} as follows: For each 1 ≤ i ≤ k, if E(vi,vi+ 1) is in G then we add the edge E(vk+ 2,vi) to G′, otherwise we add the edge E(vi,vk+ 2). Moreover, if E(vk+ 1,v1) is in G then we add the edge E(vk+ 2,vk+ 1) to G′, otherwise we add the edge E(vk+ 1,vk+ 2). Notice that G coincides with the subgraph of G′ that is induced by nodes v1,…,vk+ 1. Moreover, by construction G′ satisfies the first three conditions of the claim. We prove next that it also satisfies condition (‡).
Take an arbitrary B ⊆{v1,…,vk+ 2} of size 2 ≤ł ≤ k and a node v outside B. We prove that the condition holds by cases:
vk+ 2∉B, v ∈{v1,…,vk+ 1}, and 2 ≤ł ≤ k − 1: By inductive hypothesis there is a node v′∈{v1,…,vk+ 1}∖ B such that conn(v,B)≠conn(v′,B).
vk+ 2∉B, v ∈{v1,…,vk+ 1}, and ł = k: We set v′ := vk+ 2 and claim that the predecessor u of v in {v1,…,vk+ 1} distinguishes v and v′. Here, the “predecessor” of vi is vi− 1 if 2 ≤ i ≤ k + 1, and the predecessor of v1 is vk+ 1 (note that u ∈ B as ł = k). By construction of G′, we have that E(u,v) ∈ G′ if and only if E(v′,u) ∈ G′. We conclude that conn(v,B)≠conn(v′,B).
vk+ 2∉B and v = vk+ 2: There must exist some node v′ in {v1,…,vk+ 1} that does not belong to B but its predecessor u in {v1,…,vk+ 1} does. Then by construction of G′, we have that E(u,v′) ∈ G′ if and only if E(v,u) ∈ G′. We conclude that conn(v,B)≠conn(v′,B).
vk+ 2 ∈ B and ł ≥ 3: Then B′ = B ∖{vk+ 2} is of size 2 ≤ł − 1 ≤ k − 1. By induction hypothesis, for every node v outside B′ there is another node v′∈{v1,…,vk+ 1}∖ B′ such that conn(v,B′)≠conn(v′,B′). This implies that conn(v,B)≠conn(v′,B).
vk+ 2 ∈ B and ł = 2: Then B = {vk+ 2,u} for some u ∈{v1,…,vk+ 1}. Suppose first that u ∈{v1,v2,v3}. Since the subgraph induced by {v1,v2,v3} in G defines a directed cycle, it is the case that E(u,z) holds if and only if E(w,u) holds, where {u,w,z} = {v1,v2,v3}. Therefore, for each v ∈{v1,…,vk+ 1}∖ B there is a node v′∈{z,w} such that conn(v,{u})≠conn(v′,{u}). It follows that conn(v,B)≠conn(v′,B). Suppose now that u∉{v1,v2,v3}. It suffices to exhibit two nodes v′ and v″ outside B such that E(v′,vk+ 2) and E(vk+ 2,v″). By induction hypothesis the edges E(v1,v2) and E(v4,v3) are in G′. Therefore, vk+ 2 is connected via edges E(v3,vk+ 2) and E(vk+ 2,v1) in G′.
This concludes the proof of the lemma. □
Fix k ≥ 1. We then take as q any Boolean CQ whose canonical database is a graph G on nodes v1,…,v2k+ 1 that satisfies the conditions stated in Lemma 9. That is, (1) for each 1 ≤ i < j ≤ 2k + 1, either the edge E(vi,vj) or E(vj,vi) is in G, (2) the subgraph of G induced by {v1,v2,v3} is a directed cycle, and (3) G satisfies condition (‡). It is easy to see that q is in GHW(k + 1) ∖GHW(k) as its underlying undirected graph is a clique on 2k + 1 elements. In fact, these elements can be covered with (k + 1) edges, but not with k.
We claim that q has no GHW(ł)-overapproximation for any 1 ≤ł ≤ k. The proofs for the cases when ł = 1 and ł > 1 are slightly different. We start with the latter, i.e., when 1 < ł ≤ k. The proof for every such an ł is analogous, and thus we concentrate on proving the claim for ł = k > 1. According to Theorem 7, we need to prove that there is no constant c ≥ 0 such that for every database \(\mathcal {D}\) it holds that
It is sufficient to show then that for each integer c ≥ 0 there is a database \(\mathcal {D}\) such that
Or, equivalently, that for each integer c ≥ 0 there is a database \(\mathcal {D}\) such that
where qc, for c ≥ 0, is the CQ which is defined in Lemma 1, i.e., for every \(\mathcal {D}\) it is the case that \(q {\to _{k}^{c}} \mathcal {D}\) iff \(q_{c} \to \mathcal {D}\). In view of (1), this boils down to proving that
We prove (8) by induction. The claim clearly holds for c = 0, as by definition q0 is empty while q1 is not. Let us assume now that the claim holds for c ≥ 0. That is, qc+ 1↛qc. This means, in particular, that the core of qc+ 1 is not contained in qc. That is, this core contains at least one node w in qc+ 1 that does not belong to qc.
By the way q is defined, any k-union of q must be of the form S ⊆{v1,…,v2k+ 1} with |S| = 2k. Let us consider now (Tc+ 1,βc+ 1) as defined in the proof of Lemma 1. Since w∉qc, it must be the case that there is a unique node t of Tc+ 1 such that w ∈ βc+ 1(t). Moreover, this t must be a leaf of Tc+ 1. Suppose that ϕt(w) = v, for v ∈{v1,…,v2k+ 1}, where ϕt is as defined in the proof of Lemma 1, i.e., ϕt is a bijection between βc+ 1(t) and the k-union S ⊆{v1,…,v2k+ 1} of q such that λc+ 1(t) = S.
Notice, by definition, that if the parent of t in Tc+ 1 is t′, then either λc+ 1(t′) = ∅ – which holds precisely when t′ is the root of Tc+ 1 –, or λc+ 1(t′) = S′, where S′ is the subset of {v1,…,v2k+ 1} which contains all elements save for v. That is, in the latter case we have that S′ is obtained from S by replacing some element v′ in {v1,…,v2k+ 1}, with v′≠v, by v itself.
From Proposition 1, we can assume that the homomorphism that maps qc+ 1 to its core is a retraction, i.e., it is the identity on the nodes of this core, in particular, on w. On the other hand, w is linked in qc+ 1 exclusively with the remaining nodes that appear in βc+ 1(t). Moreover, the graph induced by the nodes in λc+ 1(t) is a clique on 2k elements, and thus all the elements in βc+ 1(t) must belong to the core of qc+ 1.
Recall that ϕt(w) = v. Take an arbitrary node v″∈ S that is not v. Notice that neither v″ = v′ as v″∈ S, while v′∉S. By definition, Tc+ 2 contains a leaf t″ whose parent is t such that λc+ 2(t″) = S″, where S″ is the subset of {v1,…,v2k+ 1} which is obtained from S by replacing v″ with the unique node in {v1,…,v2k+ 1}∖ S, namely v′. Let us assume that \(\phi _{t^{\prime \prime }}(v^{\prime }) = w^{\prime \prime }\). Notice that w″ appears in no other node in (Tc+ 2,βc+ 2).
Assume now, for the sake of contradiction, that qc+ 2 → qc+ 1. Then the core of qc+ 2 is the same than the core of qc+ 1. Let C be this core. Henceforth, from Proposition 1 there is a retraction h from qc+ 2 to C. Since all elements in βc+ 2(t) = βc+ 1(t) are in C, the homomorphism h must be the identity on them. But then h maps w′ to the unique element in qc+ 1 that is linked to exactly the same nodes than w′ in qc+ 2; namely, ϕt(v″) = w″.
Suppose that v′ and v″ represent the nodes vi and vj in {v1,…,v2k+ 1}, respectively. By assumption, i≠j. But this implies then that in the canonical database G of q we have that
where B = {v1,…,v2k+ 1}∖{vi,vj}. This is a contradiction since B is of size 2k − 1 > 1 and G satisfies condition (‡). This concludes our proof that q has no GHW(k)-overapproximation (and, analogously, that it has no GHW(ł)-overapproximation for any 1 < ł ≤ k).
We prove next that q neither has a GHW(1)-overapproximation. Let us assume, for the sake of contradiction, that q has a GHW(1)-overapproximation q′. It is an easy observation that the directed graphs in GHW(1) are precisely those whose underlying undirected graph is acyclic. Notice also that q′ has no directed cycles of length two (i.e., atoms of the form E(u,v) and E(v,u)); otherwise, since q′→ q, we would have that q also has such a cycle (which we know it does not). Using the fact that q′∈GHW(1) and has no directed cycles of length two, it is not difficult to show (see e.g. [31]) that there is a sufficiently large integer n ≥ 1 such that, if Pn is the directed path on n vertices, then
This implies that if q″ is the Boolean CQ which is naturally defined by Pn, then \(q^{\prime \prime } \subsetneq q^{\prime }\). Moreover, Pn → G. This is due to the fact that G contains a directed cycle on {v1,v2,v3}. We conclude that
and, therefore, that q′ is not a GHW(1)-overapproximation of q. This is a contradiction. We then conclude the proof of Theorem 1. □
Proof
(Lemma 8) Before proving the lemma, we need some terminology and claims. Let \(\mathcal {D}\) be a database and (A1,…,An) be a tuple of pairwise-disjoint subsets of elements of \(\mathcal {D}\), where n ≥ 0. In addition, let \(\mathcal {D}^{\prime }\) be a database and (a1,…,an) a tuple of elements in \(\mathcal {D}^{\prime }\). Then we write \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (\mathcal {D}^{\prime },(a_{1},\dots ,a_{n}))\) iff there is a homomorphism h from \(\mathcal {D}\) to \(\mathcal {D}^{\prime }\) such that, for each i ∈{1,…,n} and a ∈ Ai, it is the case that h(a) = ai.
For such a pair \((\mathcal {D},(A_{1},\dots ,A_{n}))\), with n ≥ 0, we define its generalized hypertreewidth in the natural way. The intuition is that we see \((\mathcal {D},(A_{1},\dots ,A_{n}))\) as a “query”, where A1 ∪⋯ ∪ An are the “free variables” and the rest of the elements are the “existential variables”. Formally, a tree decomposition of \((\mathcal {D},(A_{1},\dots ,A_{n}))\) is a pair (T,χ), where T is a tree and χ is a mapping that assigns a subset of the elements in \(\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n})\) to each node t ∈ T, such that the following statements hold:
- 1.
For each atom \(R(\bar a)\) in \(\mathcal {D}\), it is the case that \(\bar a\cap (\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n}))\) is contained in χ(t), for some t ∈ T.
- 2.
For each element a in \(\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n})\), the set of nodes t ∈ T for which a occurs in χ(t) is connected.
The width of node t in (T,χ) is the minimal number ℓ for which there are ℓ atoms in \(\mathcal {D}\) covering χ(t), i.e., atoms \(R(\bar a_{1}),\dots ,R(\bar a_{\ell })\) in \(\mathcal {D}\) such that \(\chi (t)\subseteq \bigcup _{1\leq i \leq \ell } \bar a_{i}\) The width of (T,χ) is the maximal width of the nodes of T.
The generalized hypertreewidth of \((\mathcal {D},(A_{1},\dots ,A_{n}))\) is the minimum width of its tree decompositions.
By mimicking the proof of the forward implication of Proposition 3, we can show the following:
Lemma 10
Fixk ≥ 1.Let\(q(\bar x),q^{\prime }(\bar x^{\prime })\)beCQs, where\(\bar x=(x_{1},\dots ,x_{n})\)and\(\bar x^{\prime }=(x_{1}^{\prime },\dots ,x_{n}^{\prime })\),forn ≥ 0.Suppose that\((q,\bar x)\to _{k} (q^{\prime },\bar x^{\prime })\).Then, for each database\(\mathcal {D}\)andtuple (A1,…,An) of subsets of\(\mathcal {D}\)suchthat\((\mathcal {D},(A_{1},\dots ,A_{n}))\)hasgeneralized hypertreewidth at most k, it is the case that
Proof
Let \(\mathcal {H}\) be a winning strategy for Duplicator witnessing the fact that \((q,\bar x)\to _{k} (q^{\prime },\bar x^{\prime })\). Let us assume that \((\mathcal {D},(A_{1},\dots ,A_{n}))\) has generalized hypertreewidth at most k, and that \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (q,(x_{1},\dots ,x_{n}))\) is witnessed via a homomorphism h. Then we can compose h with the strategy \(\mathcal {H}\) to define a homomorphism g witnessing \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (q^{\prime },(x_{1}^{\prime },\dots ,x_{n}^{\prime }))\). The mapping g is defined in a top-down fashion over the tree decomposition (T,χ) of width at most k of \((\mathcal {D},(A_{1},\dots ,A_{n}))\). One starts at the root r of T, and forces Spoiler to play his pebbles over the set h(χ(r)). If Duplicator responds according to \(\mathcal {H}\) with a partial homomorphism fr, we then let g(a) = fr(h(a)), for each a ∈ χ(r). We then move to each child of r and so on, until all leaves are reached and g is defined over all elements in \(\mathcal {D}\setminus (A_{1}\cup \cdots \cup A_{n})\). Since Duplicator responds to Spoiler’s moves with consistent partial homomorphisms, we have that g is actually a well-defined homomorphism from \((\mathcal {D},(A_{1},\dots ,A_{n}))\) to \((q^{\prime },(x_{1}^{\prime },\dots ,x_{n}^{\prime }))\). □
Now we are ready to show our lemma. Suppose that \((q,\bar x)\to _{k}(q^{\prime },\bar x^{\prime })\), where \(\bar x=(x_{1},\dots ,x_{n})\) and \(\bar x^{\prime }=(x_{1}^{\prime },\dots ,x_{n}^{\prime })\), for some n ≥ 0. Assume that \((q^{\prime \prime },\bar x^{\prime \prime })\to (q^{\prime }\wedge q, \bar z)\) via a homomorphism h, for \(q^{\prime \prime }(\bar x^{\prime \prime })\in \textsf {GHW}(k)\), and suppose that \(\bar x^{\prime \prime }=(x_{1}^{\prime \prime },\dots ,x_{n}^{\prime \prime })\) and \(\bar z=(z_{1},\dots ,z_{n})\). For each i ∈{1,…,n}, we define Vi to be the set of variables x in q″ such that h(x) = zi. In particular, \(x_{i}^{\prime \prime }\in V_{i}\), for each i ∈{1,…,n}. We define V to be the set of variables x in q″ such that h(x) = y, where y is an existentially quantified variable of q. Similarly, we define V′ with respect to the existentially quantified variables of q′. Note that the sets V,V′,V1,…,Vn form a partition of the variables of q″.
Recall that \(\mathcal {D}_{q^{\prime \prime }}\) be the canonical database of q″. Since q″∈GHW(k), we know that
has generalized hypertreewidth at most k, as defined above. Let \(\mathcal {D}_{V}\) be the database induced in \(\mathcal {D}_{q^{\prime \prime }}\) by the set of variables V ∪ V1 ∪⋯ ∪ Vn, i.e., the set of atoms \(R(\bar t)\in \mathcal {D}_{q^{\prime \prime }}\) such that each element in \(\bar t\) is in V ∪ V1 ∪⋯ ∪ Vn. We now show that
has also generalized hypertreewidth at most k. Indeed, let (T,χ) be the tree decomposition of \((\mathcal {D}_{q^{\prime \prime }}\), \((\{x_{1}^{\prime \prime }\},\dots ,\{x_{n}^{\prime \prime }\}))\) of width at most k. Define χ′ such that for each t ∈ T, we have that χ′(t) = χ(t) ∩ V. We claim that (T,χ′) is a tree decomposition of \((\mathcal {D}_{V}\), (V1,…,Vn)) of width at most k.
In fact, since (T,χ) is a tree decomposition, we have that, for each a ∈ V, it is the case that the set {t ∈ T∣a ∈ χ′(t)} is connected; and for each atom \(R(\bar a)\in \mathcal {D}_{V}\), there is a node t ∈ T such that \(\bar a\cap V\subseteq \chi ^{\prime }(t)\). To see that the width of (T,χ′) is bounded by k, let t be a node in T. Since the width of (T,χ) is at most k, there are ℓ atoms \(R(\bar a_{1}),\dots ,R(\bar a_{\ell })\) in \(\mathcal {D}_{q^{\prime \prime }}\), with ℓ ≤ k, such that \(\chi (t)\subseteq \bigcup _{1\leq i \leq \ell } \bar a_{i}\). Let \(R(\bar a_{i_{1}}),\dots ,R(\bar a_{i_{p}})\), where 1 ≤ i1 < ⋯ < ip ≤ ℓ and p ≤ ℓ, be the atoms in \(\{R(\bar a_{1}),\dots ,R(\bar a_{\ell })\}\) that contain an element in χ′(t). Since χ′(t) ⊆ χ(t), it is the case that \(\chi ^{\prime }(t)\subseteq \bigcup _{1\leq j \leq p} \bar a_{i_{j}}\). It suffices to show that each \(R(\bar a_{i_{j}})\) is actually an atom in \(\mathcal {D}_{V}\), for 1 ≤ j ≤ p. Towards a contradiction, suppose that this is not the case. Then, there is an atom in \(\mathcal {D}_{q^{\prime \prime }}\) that contains simultaneously one variable in χ′(t) ⊆ V and one variable in V′. By the definitions of V′ and V, and the fact that h is a homomorphism, it follows that there is an atom in \((q^{\prime }\wedge q)(\bar z)\) that mentions simultaneously one existentially quantified variable from q′ and one from q; this contradicts the definition of \((q^{\prime }\wedge q)(\bar z)\). We conclude that the generalized hypertreewidth of \((\mathcal {D}_{V},(V_{1},\dots ,V_{n}))\) is at most k.
Recall that h is our initial homomorphism from \((q^{\prime \prime },\bar x^{\prime \prime })\) to \((q^{\prime }\wedge q, \bar z)\). Let hV be the restriction of h to the set V ∪ V1 ∪⋯ ∪ Vn. By construction,
via homomorphism hV. We can then apply Lemma 10 and obtain that
via a homomorphism h′. We define our required homomorphism g from \((q^{\prime \prime },\bar x^{\prime \prime })\) to \((q^{\prime },\bar x^{\prime })\) as follows: if a ∈ V ∪ V1 ∪⋯ ∪ Vn, then g(a) = h′(a); otherwise, if a ∈ V′, then g(a) = h(a). To see that g is a homomorphism, it suffices to consider an atom \(R(\bar a)\in \mathcal {D}_{q^{\prime \prime }}\) such that \(\bar a\) contains an element in V′ and one element not in V′, and show that \(R(g(\bar a))\in \mathcal {D}_{q^{\prime }}\). Let A be the set of elements in \(\bar a\) that are not in V′. As mentioned above, there are no atoms in \(\mathcal {D}_{q^{\prime \prime }}\) mentioning elements in V′ and V simultaneously, thus A ⊆ V1 ∪⋯ ∪ Vn. In particular, h(a) = h′(a), for each a ∈ A. It follows that \(R(g(\bar a))=R(h(\bar a))\), from which we conclude that \(R(g(\bar a))\in \mathcal {D}_{q^{\prime }}\). □
Proof
(Proposition 10) Consider the Boolean CQ q from Fig. 2, defined as
and the CQ q′ from the same figure defined by
For each n ≥ 1, we define the CQ
Observe that q′∧ qn ∈GHW(1), for each n ≥ 1. We now show that, for each n ≥ 1, q′∧ qn is an incomparable GHW(1)-Δ-approximation of q. As mentioned in Example 2, we have that q →1q′. In particular q →1(q′∧ qn). Clearly, q↛(q′∧ qn). Also, qn↛q since variables x1 and xn+ 1 of qn cannot be mapped to any variable in q via a homomorphism. Therefore, (q′∧ qn)↛q. By Theorem 11, it follows that q′∧ qn is an incomparable GHW(1)-Δ-approximation of q.
Now we show that the CQs {q′∧ qn}n≥ 1 form a family of non-equivalent CQs. First note that qn↛q′, for each n ≥ 1. Also, observe that qi → qj iff i = j, for i,j ≥ 1. It follows that for each i,j ≥ 1, such that i≠j, it is the case that (q′∧ qi)↛(q′∧ qj) and (q′∧ qj)↛(q′∧ qi). In particular, {q′∧ qn}n≥ 1 is a family of non-equivalent CQs. □
Proof
(Proposition 11) As already mentioned, the c oNP upper bound follows directly from Theorem 11. For the lower bound, we consider the N on-Hom(H) problem, for a fixed directed graph H, which asks, given a directed graph G, whether G↛H. Let us assume that, for each k ≥ 1, there is a directed graph Hk such that:
- 1.
Hk ∈GHW(k), or more formally, the Boolean CQ \(q_{H_k}\) whose canonical database is Hk belongs to GHW(k).
- 2.
N on-Hom(Hk) is c oNP-complete even when the input directed graph G satisfies that Hk↛G.
We later explain how to obtain these graphs Hk’s. Now we reduce from the restricted version of N on-Hom(Hk) given by item (2) above. Let G be a directed graph such that Hk↛G. We first check in polynomial time whether G →kHk. If G↛kHk, we output a fixed pair \(q_0,q_0^{\prime }\) such that \(q_0^{\prime }\in \textsf {GHW}(k)\) and \(q_0^{\prime }\) is an incomparable GHW(k)-Δ-approximation of q0. In case that G →kHk, we output the pair \(q_{G}, q_{H_k}\), where qG and \(q_{H_k}\) are Boolean CQs whose canonical databases are precisely G and Hk, respectively. Since \(q_{H_k}\in \textsf {GHW}(k)\) by item (1) above, the reduction is well-defined.
Suppose first that G↛Hk. If G↛kHk, then we are done, since \(q_0^{\prime }\) is an incomparable GHW(k)-Δ-approximation of q0. Otherwise, if G →kHk, since G↛Hk and Hk↛G (item (2) above), Theorem 11 implies that \(q_{H_k}\) is an incomparable GHW(k)-Δ-approximation of qG. On the other hand, assume that G → Hk. In particular, we have that G →kHk, and then, in this case, the reduction outputs the pair \(q_{G}, q_{H_k}\). Since G → Hk, we conclude that \(q_{H_k}\) is not an incomparable GHW(k)-Δ-approximation of qG.
It remains to define the directed graph Hk. If k ≥ 2, it suffices to consider the clique on 2k vertices, that is, the directed graph K2k whose vertex set is {1,…, 2k} and whose edges are {(i,j)∣i≠j, for i,j ∈{1,…, 2k}}. We have that K2k ∈GHW(k), and thus item (1) above is satisfied. Also, we can reduce from the non-2k-colorability problem by replacing each undirected edge {u,v} of a given undirected graph G, by a directed edge in an arbitrary direction, e.g., from u to v. Clearly, this is a reduction from non-2k-colorability to N on-Hom(K2k). Also note that the output f(G) of the reduction satisfies that K2k↛f(G), as f(G) has no directed loops nor directed cycles of length 2. Therefore, item (2) above is satisfied. For k = 1, it is known from [30] that there is an oriented tree T (i.e., a directed graph whose underlying undirected graph is a tree and has no directed cycles of length 1 (loops) and 2) such that N on-Hom(T) is c oNP-complete. Since T is an oriented tree then it belongs to GHW(1), and then item (1) is satisfied. Also, by inspecting the reduction in [30], we have that item (2) also holds. □
Rights and permissions
About this article
Cite this article
Barceló, P., Romero, M. & Zeume, T. A More General Theory of Static Approximations for Conjunctive Queries. Theory Comput Syst 64, 916–964 (2020). https://doi.org/10.1007/s00224-019-09924-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-019-09924-0