## Abstract

Conjunctive query (CQ) evaluation is NP-complete, but becomes tractable for fragments of bounded hypertreewidth. Approximating a hard CQ by a query from such a fragment can thus allow for an efficient approximate evaluation. While underapproximations (i.e., approximations that return correct answers only) are well-understood, the dual notion of overapproximations (i.e, approximations that return complete – but not necessarily sound – answers), and also a more general notion of approximation based on the symmetric difference of query results, are almost unexplored. In fact, the decidability of the basic problems of evaluation, identification, and existence of those approximations has been open. This article establishes a connection between overapproximations and existential pebble games that allows for studying such problems systematically. Building on this connection, it is shown that the evaluation and identification problem for overapproximations can be solved in polynomial time. While the general existence problem remains open, the problem is shown to be decidable in 2EXPTIME over the class of acyclic CQs and in PTIME for Boolean CQs over binary schemata. Additionally we propose a more liberal notion of overapproximations to remedy the known shortcoming that queries might not have an overapproximation, and study how queries can be overapproximated in the presence of tuple generating and equality generating dependencies. The techniques are then extended to symmetric difference approximations and used to provide several complexity results for the identification, existence, and evaluation problem for this type of approximations.

This is a preview of subscription content, log in to check access.

## Notes

- 1.
Recall that the symmetric difference between sets

*A*and*B*is (*A*∖*B*) ∪ (*B*∖*A*).

## References

- 1.
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

- 2.
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003)

- 3.
Bárány, V., Gottlob, G., Otto, M.: Querying the guarded fragment. Logical Methods in Computer Science 10(2) (2014)

- 4.
Barceló, P.: Querying graph databases. In: PODS, pp. 175–188 (2013)

- 5.
Barceló, P., Gottlob, G., Pieris, A.: Semantic acyclicity under constraints. In: PODS, pp. 343–354 (2016)

- 6.
Barceló, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. In: PODS, pp. 249–260 (2012)

- 7.
Barceló, P., Libkin, L., Romero, M.: Efficient approximations of conjunctive queries. SIAM J. Comput.

**43**(3), 1085–1130 (2014) - 8.
Barceló, P., Romero, M., Vardi, M.Y.: Semantic acyclicity on graph databases. SIAM J. Comput.

**45**(4), 1339–1376 (2016) - 9.
Blumensath, A., Otto, M., Weyer, M.: Decidability results for the boundedness problem. Logical Methods in Computer Science 10(3) (2014)

- 10.
Calì, A., Gottlob, G., Kifer, M.: Taming the infinite chase: Query answering under expressive relational constraints. In: KR, pp. 70–80 (2008)

- 11.
Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: STOC, pp. 77–90 (1977)

- 12.
Chekuri, C., Rajaraman, A.: Conjunctive query containment revisited. Theor. Comput. Sci.

**239**(2), 211–229 (2000) - 13.
Chen, H., Dalmau, V.: Beyond hypertree width: decomposition methods without decompositions. In: CP, pp. 167–181 (2005)

- 14.
Cosmadakis, S.S., Gaifman, H., Kanellakis, P.C., Vardi, M.Y.: Decidable optimization problems for database logic programs (Preliminary Report). In: STOC, pp. 477–490 (1988)

- 15.
Dalmau, V., Kolaitis, P.G., Vardi, M.Y.: Constraint satisfaction, bounded treewidth, and finite-variable logics. In: CP, pp. 310–326 (2002)

- 16.
Deutsch, A., Nash, A., Remmel, J.B.: The chase revisisted. In: PODS, pp. 149–158 (2008)

- 17.
Fagin, R.: A normal form for relational databases that is based on domains and keys. ACM Trans. Database Syst.

**6**(3), 387–415 (1981) - 18.
Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: Semantics and query answering. Theor. Comput. Sci.

**336**(1), 89–124 (2005) - 19.
Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: From intractable to polynomial time. PVLDB

**3**(1), 264–275 (2010) - 20.
Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)

- 21.
Fischl, W., Gottlob, G., Pichler, R.: General and fractional hypertree decompositions: hard and easy cases. In: PODS, pp. 17–32 (2018)

- 22.
Gaifman, H., Mairson, H.G., Sagiv, Y., Vardi, M.Y.: Undecidable optimization problems for database logic programs. J. ACM

**40**(3), 683–713 (1993) - 23.
Garofalakis, M., Gibbon, P.: Approximate query processing: taming the terabytes. In: VLDB, p. 725 (2001)

- 24.
Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: PODS, pp. 57–74 (2016)

- 25.
Gottlob, G., Leone, N., Scarcello, F.: Hypertree decompositions and tractable queries. J. Comput. Syst. Sci.

**64**(3), 579–627 (2002) - 26.
Gottlob, G., Miklós, Z., Schwentick, T.: Generalized hypertree decompositions: NP-hardness and tractable variants. J ACM 56(6), 30:1–30:32 (2009)

- 27.
Greco, G., Scarcello, F.: The power of local consistency in conjunctive queries and constraint satisfaction problems. SIAM J. Comput.

**46**(3), 1111–1145 (2017) - 28.
Grohe, M., Marx, D.: Constraint solving via fractional edge covers. In: SODA, pp. 289–298 (2006)

- 29.
Hell, P., Nesetril, J.: The core of a graph. Discret. Math.

**109**(1-3), 117–126 (1992) - 30.
Hell, P., Nesetril, J., Zhu, X.: Complexity of tree homomorphisms. Discret. Appl. Math.

**70**(1), 23–36 (1996) - 31.
Hell, P., Nešeťril, J.: Graphs and Homomorphisms. Oxford University Press, Oxford (2004)

- 32.
Ioannidis, Y.: Approximations in database systems. In: ICDT, pp. 16–30 (2003)

- 33.
Kolaitis, P.G., Panttaja, J.: On the complexity of existential pebble games. In: CSL, pp. 314–329 (2003)

- 34.
Kolaitis, P.G., Vardi, M.Y.: On the expressive power of datalog: Tools and a case study. J. Comput. Syst. Sci.

**51**(1), 110–134 (1995) - 35.
Kolaitis, P.G., Vardi, M.Y.: Conjunctive-query containment and constraint satisfaction. J. Comput. Syst. Sci.

**61**(2), 302–332 (2000) - 36.
Liu, Q.: Approximate query processing. In: Encyclopedia of Database Systems, pp 113–119 (2009)

- 37.
Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of data dependencies. ACM Trans. Database Syst.

**4**(4), 455–469 (1979) - 38.
Otto, M.: The boundedness problem for monadic universal first-order logic. In: LICS, pp. 37–48 (2006)

- 39.
Papadimitriou, C.H., Yannakakis, M.: On the complexity of database queries. J. Comput. Syst. Sci.

**58**(3), 407–427 (1999) - 40.
Yannakakis, M.: Algorithms for acyclic database schemes. In: VLDB, pp. 82–94 (1981)

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on *Special Issue on Database Theory (2018)*

Barceló is funded by Millennium Institute for Foundational Research on Data and Fondecyt Grant 1170109. Zeume acknowledges the financial support by the European Research Council (ERC), grant agreement No 683080. Romero and Zeume thank the Simons Institute for the Theory of Computing for hosting them. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 714532). The paper reflects only the authors’ views and not the views of the ERC or the European Commission. The European Union is not liable for any use that may be made of the information contained therein.

## Appendix

### Appendix

### Proof

(Theorem 1) Fix *k* > 1. The CQ q is defined over graphs, i.e., over a schema with a single binary relation symbol E, and consists of *k* + 1 variables *v*_{1},…,*v*_{k+ 1}. For every 1 ≤ *i* < *j* ≤ *k* + 1 we add either the atom (i.e., edge) *E*(*v*_{i},*v*_{j}) or *E*(*v*_{j},*v*_{i}) to q in such a way that the subgraph of G induced by {*v*_{1},*v*_{2},*v*_{3}} is a directed cycle and a certain condition (‡), defined below, holds. We start introducing some terminology.

Let G be a directed graph on nodes *v*_{1},…,*v*_{k+ 1} that contains, for each 1 ≤ *i* < *j* ≤ *k* + 1, either the edge *E*(*v*_{i},*v*_{j}) or *E*(*v*_{j},*v*_{i}). For a *B* ⊆{*v*_{1},…,*v*_{k+ 1}} of size 1 ≤ł ≤ *k* − 1 and a node *v* ∈{*v*_{1},…,*v*_{k+ 1}}∖ *B*, we define c*o**n**n*(*v*,*B*) as the tuple (*e*_{1},…,*e*_{k+ 1}) ∈{− 1, 1, *#*}^{k+ 1} such that for each 1 ≤ *p* ≤ *k* + 1:

In simple terms, c*o**n**n*(*v*,*B*) specifies how *v* connects with the nodes in *B*.

Our condition (‡) then establishes the following:

- (‡)
For each

*B*⊆{*v*_{1},…,*v*_{k+ 1}} of size 2 ≤ł ≤*k*− 1 and each node*v*in {*v*_{1},…,*v*_{k+ 1}}∖*B*, there is a node*v*^{′}∈{*v*_{1},…,*v*_{k+ 1}}∖*B*such that$$ {\textsf conn}(v,B) \quad \neq \quad {\textsf conn}(v^{\prime},B). $$That is, for each such

*B*and*v*we will always be able to find another*v*^{′}outside*B*that connects to the nodes in*B*in a different way than*v*.

### Example 6

The graphs in Fig. 6 satisfy this condition for *k* = 2, 3, 4, respectively. Notice that the directed cycle on nodes {*v*_{1},*v*_{2},*v*_{3}}, shown in the left-hand side, satisfies condition (‡) trivially.

The next lemma establishes that for each *k* > 1 there is always a graph that satisfies this condition.

###
**Lemma 9**

*For each**k* > 1*, there is a**directed graph* G *on nodes**v*_{1},…,*v*_{k+ 1}*such that the following hold:*

- 1.
*For each*1 ≤*i*<*j*≤*k*+ 1*,**either the edge**E*(*v*_{i},*v*_{j})*or**E*(*v*_{j},*v*_{i})*is in*G*;* - 2.
*the subgraph of*G*induced by*{*v*_{1},*v*_{2},*v*_{3}}*is a directed cycle; and* - 3.
G

*satisfies condition**(*‡*)**.*

### Proof

(Lemma 9) For *k* = 2 this is given by the graph in Example 6. For *k* ≥ 3 we prove by induction a stronger claim: There is a directed graph G on nodes *v*_{1},…,*v*_{k+ 1} such that:

- 1.
G contains either the edge

*E*(*v*_{i},*v*_{j}) or*E*(*v*_{j},*v*_{i}) for each 1 ≤*i*<*j*≤*k*+ 1. - 2.
The subgraph of G induced by {

*v*_{1},*v*_{2},*v*_{3}} is a directed cycle. - 3.
G contains the edges

*E*(*v*_{1},*v*_{2}) and*E*(*v*_{4},*v*_{3}). - 4.
G satisfies condition (‡).

The basis case *k* = 3 is given again by the graph in Example 6. For the inductive case, assume by induction hypothesis that there is a directed graph G on nodes *v*_{1},…,*v*_{k+ 1} that satisfies the claim above. A new graph *G*^{′} is then created from G by adding a new node *v*_{k+ 2} and connecting it to the nodes in {*v*_{1},…,*v*_{k+ 1}} as follows: For each 1 ≤ *i* ≤ *k*, if *E*(*v*_{i},*v*_{i+ 1}) is in G then we add the edge *E*(*v*_{k+ 2},*v*_{i}) to *G*^{′}, otherwise we add the edge *E*(*v*_{i},*v*_{k+ 2}). Moreover, if *E*(*v*_{k+ 1},*v*_{1}) is in G then we add the edge *E*(*v*_{k+ 2},*v*_{k+ 1}) to *G*^{′}, otherwise we add the edge *E*(*v*_{k+ 1},*v*_{k+ 2}). Notice that G coincides with the subgraph of *G*^{′} that is induced by nodes *v*_{1},…,*v*_{k+ 1}. Moreover, by construction *G*^{′} satisfies the first three conditions of the claim. We prove next that it also satisfies condition (‡).

Take an arbitrary *B* ⊆{*v*_{1},…,*v*_{k+ 2}} of size 2 ≤ł ≤ *k* and a node v outside B. We prove that the condition holds by cases:

*v*_{k+ 2}∉*B*,*v*∈{*v*_{1},…,*v*_{k+ 1}}, and 2 ≤ł ≤*k*− 1: By inductive hypothesis there is a node*v*^{′}∈{*v*_{1},…,*v*_{k+ 1}}∖*B*such that c*o**n**n*(*v*,*B*)≠c*o**n**n*(*v*^{′},*B*).*v*_{k+ 2}∉*B*,*v*∈{*v*_{1},…,*v*_{k+ 1}}, and ł =*k*: We set*v*^{′}:=*v*_{k+ 2}and claim that the predecessor u of v in {*v*_{1},…,*v*_{k+ 1}} distinguishes v and*v*^{′}. Here, the “predecessor” of*v*_{i}is*v*_{i− 1}if 2 ≤*i*≤*k*+ 1, and the predecessor of*v*_{1}is*v*_{k+ 1}(note that*u*∈*B*as ł =*k*). By construction of*G*^{′}, we have that*E*(*u*,*v*) ∈*G*^{′}if and only if*E*(*v*^{′},*u*) ∈*G*^{′}. We conclude that c*o**n**n*(*v*,*B*)≠c*o**n**n*(*v*^{′},*B*).*v*_{k+ 2}∉*B*and*v*=*v*_{k+ 2}: There must exist some node*v*^{′}in {*v*_{1},…,*v*_{k+ 1}} that does not belong to B but its predecessor u in {*v*_{1},…,*v*_{k+ 1}} does. Then by construction of*G*^{′}, we have that*E*(*u*,*v*^{′}) ∈*G*^{′}if and only if*E*(*v*,*u*) ∈*G*^{′}. We conclude that c*o**n**n*(*v*,*B*)≠c*o**n**n*(*v*^{′},*B*).*v*_{k+ 2}∈*B*and ł ≥ 3: Then*B*^{′}=*B*∖{*v*_{k+ 2}} is of size 2 ≤ł − 1 ≤*k*− 1. By induction hypothesis, for every node v outside*B*^{′}there is another node*v*^{′}∈{*v*_{1},…,*v*_{k+ 1}}∖*B*^{′}such that c*o**n**n*(*v*,*B*^{′})≠c*o**n**n*(*v*^{′},*B*^{′}). This implies that c*o**n**n*(*v*,*B*)≠c*o**n**n*(*v*^{′},*B*).*v*_{k+ 2}∈*B*and ł = 2: Then*B*= {*v*_{k+ 2},*u*} for some*u*∈{*v*_{1},…,*v*_{k+ 1}}. Suppose first that*u*∈{*v*_{1},*v*_{2},*v*_{3}}. Since the subgraph induced by {*v*_{1},*v*_{2},*v*_{3}} in G defines a directed cycle, it is the case that*E*(*u*,*z*) holds if and only if*E*(*w*,*u*) holds, where {*u*,*w*,*z*} = {*v*_{1},*v*_{2},*v*_{3}}. Therefore, for each*v*∈{*v*_{1},…,*v*_{k+ 1}}∖*B*there is a node*v*^{′}∈{*z*,*w*} such that c*o**n**n*(*v*,{*u*})≠c*o**n**n*(*v*^{′},{*u*}). It follows that c*o**n**n*(*v*,*B*)≠c*o**n**n*(*v*^{′},*B*). Suppose now that*u*∉{*v*_{1},*v*_{2},*v*_{3}}. It suffices to exhibit two nodes*v*^{′}and*v*^{″}outside B such that*E*(*v*^{′},*v*_{k+ 2}) and*E*(*v*_{k+ 2},*v*^{″}). By induction hypothesis the edges*E*(*v*_{1},*v*_{2}) and*E*(*v*_{4},*v*_{3}) are in*G*^{′}. Therefore,*v*_{k+ 2}is connected via edges*E*(*v*_{3},*v*_{k+ 2}) and*E*(*v*_{k+ 2},*v*_{1}) in*G*^{′}.

This concludes the proof of the lemma. □

Fix *k* ≥ 1. We then take as *q* any Boolean CQ whose canonical database is a graph *G* on nodes *v*_{1},…,*v*_{2k+ 1} that satisfies the conditions stated in Lemma 9. That is, (1) for each 1 ≤ *i* < *j* ≤ 2*k* + 1, either the edge *E*(*v*_{i},*v*_{j}) or *E*(*v*_{j},*v*_{i}) is in *G*, (2) the subgraph of *G* induced by {*v*_{1},*v*_{2},*v*_{3}} is a directed cycle, and (3) *G* satisfies condition (‡). It is easy to see that *q* is in GHW(*k* + 1) ∖GHW(*k*) as its underlying undirected graph is a clique on 2*k* + 1 elements. In fact, these elements can be covered with (*k* + 1) edges, but not with *k*.

We claim that *q* has no GHW(ł)-overapproximation for any 1 ≤ł ≤ *k*. The proofs for the cases when ł = 1 and ł > 1 are slightly different. We start with the latter, i.e., when 1 < ł ≤ *k*. The proof for every such an ł is analogous, and thus we concentrate on proving the claim for ł = *k* > 1. According to Theorem 7, we need to prove that there is no constant *c* ≥ 0 such that for every database \(\mathcal {D}\) it holds that

It is sufficient to show then that for each integer *c* ≥ 0 there is a database \(\mathcal {D}\) such that

Or, equivalently, that for each integer *c* ≥ 0 there is a database \(\mathcal {D}\) such that

where *q*_{c}, for *c* ≥ 0, is the CQ which is defined in Lemma 1, i.e., for every \(\mathcal {D}\) it is the case that \(q {\to _{k}^{c}} \mathcal {D}\) iff \(q_{c} \to \mathcal {D}\). In view of (1), this boils down to proving that

We prove (8) by induction. The claim clearly holds for *c* = 0, as by definition *q*_{0} is empty while *q*_{1} is not. Let us assume now that the claim holds for *c* ≥ 0. That is, *q*_{c+ 1}↛*q*_{c}. This means, in particular, that the core of *q*_{c+ 1} is not contained in *q*_{c}. That is, this core contains at least one node *w* in *q*_{c+ 1} that does not belong to *q*_{c}.

By the way *q* is defined, any *k*-union of *q* must be of the form *S* ⊆{*v*_{1},…,*v*_{2k+ 1}} with |*S*| = 2*k*. Let us consider now (*T*_{c+ 1},*β*_{c+ 1}) as defined in the proof of Lemma 1. Since *w*∉*q*_{c}, it must be the case that there is a unique node *t* of *T*_{c+ 1} such that *w* ∈ *β*_{c+ 1}(*t*). Moreover, this *t* must be a leaf of *T*_{c+ 1}. Suppose that *ϕ*_{t}(*w*) = *v*, for *v* ∈{*v*_{1},…,*v*_{2k+ 1}}, where *ϕ*_{t} is as defined in the proof of Lemma 1, i.e., *ϕ*_{t} is a bijection between *β*_{c+ 1}(*t*) and the *k*-union *S* ⊆{*v*_{1},…,*v*_{2k+ 1}} of *q* such that *λ*_{c+ 1}(*t*) = *S*.

Notice, by definition, that if the parent of *t* in *T*_{c+ 1} is *t*^{′}, then either *λ*_{c+ 1}(*t*^{′}) = *∅* – which holds precisely when *t*^{′} is the root of *T*_{c+ 1} –, or *λ*_{c+ 1}(*t*^{′}) = *S*^{′}, where *S*^{′} is the subset of {*v*_{1},…,*v*_{2k+ 1}} which contains all elements save for *v*. That is, in the latter case we have that *S*^{′} is obtained from *S* by replacing some element *v*^{′} in {*v*_{1},…,*v*_{2k+ 1}}, with *v*^{′}≠*v*, by *v* itself.

From Proposition 1, we can assume that the homomorphism that maps *q*_{c+ 1} to its core is a retraction, i.e., it is the identity on the nodes of this core, in particular, on *w*. On the other hand, *w* is linked in *q*_{c+ 1} exclusively with the remaining nodes that appear in *β*_{c+ 1}(*t*). Moreover, the graph induced by the nodes in *λ*_{c+ 1}(*t*) is a clique on 2*k* elements, and thus all the elements in *β*_{c+ 1}(*t*) must belong to the core of *q*_{c+ 1}.

Recall that *ϕ*_{t}(*w*) = *v*. Take an arbitrary node *v*^{″}∈ *S* that is not *v*. Notice that neither *v*^{″} = *v*^{′} as *v*^{″}∈ *S*, while *v*^{′}∉*S*. By definition, *T*_{c+ 2} contains a leaf *t*^{″} whose parent is *t* such that *λ*_{c+ 2}(*t*^{″}) = *S*^{″}, where *S*^{″} is the subset of {*v*_{1},…,*v*_{2k+ 1}} which is obtained from *S* by replacing *v*^{″} with the unique node in {*v*_{1},…,*v*_{2k+ 1}}∖ *S*, namely *v*^{′}. Let us assume that \(\phi _{t^{\prime \prime }}(v^{\prime }) = w^{\prime \prime }\). Notice that *w*^{″} appears in no other node in (*T*_{c+ 2},*β*_{c+ 2}).

Assume now, for the sake of contradiction, that *q*_{c+ 2} → *q*_{c+ 1}. Then the core of *q*_{c+ 2} is the same than the core of *q*_{c+ 1}. Let *C* be this core. Henceforth, from Proposition 1 there is a retraction *h* from *q*_{c+ 2} to *C*. Since all elements in *β*_{c+ 2}(*t*) = *β*_{c+ 1}(*t*) are in *C*, the homomorphism *h* must be the identity on them. But then *h* maps *w*^{′} to the unique element in *q*_{c+ 1} that is linked to exactly the same nodes than *w*^{′} in *q*_{c+ 2}; namely, *ϕ*_{t}(*v*^{″}) = *w*^{″}.

Suppose that *v*^{′} and *v*^{″} represent the nodes *v*_{i} and *v*_{j} in {*v*_{1},…,*v*_{2k+ 1}}, respectively. By assumption, *i*≠*j*. But this implies then that in the canonical database *G* of *q* we have that

where *B* = {*v*_{1},…,*v*_{2k+ 1}}∖{*v*_{i},*v*_{j}}. This is a contradiction since *B* is of size 2*k* − 1 > 1 and *G* satisfies condition (‡). This concludes our proof that *q* has no GHW(*k*)-overapproximation (and, analogously, that it has no GHW(ł)-overapproximation for any 1 < ł ≤ *k*).

We prove next that *q* neither has a GHW(1)-overapproximation. Let us assume, for the sake of contradiction, that *q* has a GHW(1)-overapproximation *q*^{′}. It is an easy observation that the directed graphs in GHW(1) are precisely those whose underlying undirected graph is acyclic. Notice also that *q*^{′} has no directed cycles of length two (i.e., atoms of the form *E*(*u*,*v*) and *E*(*v*,*u*)); otherwise, since *q*^{′}→ *q*, we would have that *q* also has such a cycle (which we know it does not). Using the fact that *q*^{′}∈GHW(1) and has no directed cycles of length two, it is not difficult to show (see e.g. [31]) that there is a sufficiently large integer *n* ≥ 1 such that, if **P**_{n} is the directed path on *n* vertices, then

This implies that if *q*^{″} is the Boolean CQ which is naturally defined by **P**_{n}, then \(q^{\prime \prime } \subsetneq q^{\prime }\). Moreover, **P**_{n} → *G*. This is due to the fact that *G* contains a directed cycle on {*v*_{1},*v*_{2},*v*_{3}}. We conclude that

and, therefore, that *q*^{′} is not a GHW(1)-overapproximation of *q*. This is a contradiction. We then conclude the proof of Theorem 1. □

### Proof

(Lemma 8) Before proving the lemma, we need some terminology and claims. Let \(\mathcal {D}\) be a database and (*A*_{1},…,*A*_{n}) be a tuple of pairwise-disjoint subsets of elements of \(\mathcal {D}\), where *n* ≥ 0. In addition, let \(\mathcal {D}^{\prime }\) be a database and (*a*_{1},…,*a*_{n}) a tuple of elements in \(\mathcal {D}^{\prime }\). Then we write \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (\mathcal {D}^{\prime },(a_{1},\dots ,a_{n}))\) iff there is a homomorphism h from \(\mathcal {D}\) to \(\mathcal {D}^{\prime }\) such that, for each *i* ∈{1,…,*n*} and *a* ∈ *A*_{i}, it is the case that *h*(*a*) = *a*_{i}.

For such a pair \((\mathcal {D},(A_{1},\dots ,A_{n}))\), with *n* ≥ 0, we define its generalized hypertreewidth in the natural way. The intuition is that we see \((\mathcal {D},(A_{1},\dots ,A_{n}))\) as a “query”, where *A*_{1} ∪⋯ ∪ *A*_{n} are the “free variables” and the rest of the elements are the “existential variables”. Formally, a tree decomposition of \((\mathcal {D},(A_{1},\dots ,A_{n}))\) is a pair (*T*,*χ*), where T is a tree and *χ* is a mapping that assigns a subset of the elements in \(\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n})\) to each node *t* ∈ *T*, such that the following statements hold:

- 1.
For each atom \(R(\bar a)\) in \(\mathcal {D}\), it is the case that \(\bar a\cap (\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n}))\) is contained in

*χ*(*t*), for some*t*∈*T*. - 2.
For each element a in \(\mathcal {D}\setminus (A_{1}\cup {\cdots } \cup A_{n})\), the set of nodes

*t*∈*T*for which a occurs in*χ*(*t*) is connected.

The width of node t in (*T*,*χ*) is the minimal number *ℓ* for which there are *ℓ* atoms in \(\mathcal {D}\) covering *χ*(*t*), i.e., atoms \(R(\bar a_{1}),\dots ,R(\bar a_{\ell })\) in \(\mathcal {D}\) such that \(\chi (t)\subseteq \bigcup _{1\leq i \leq \ell } \bar a_{i}\) The width of (*T*,*χ*) is the maximal width of the nodes of T.

The generalized hypertreewidth of \((\mathcal {D},(A_{1},\dots ,A_{n}))\) is the minimum width of its tree decompositions.

By mimicking the proof of the forward implication of Proposition 3, we can show the following:

###
**Lemma 10**

*Fix**k* ≥ 1*.**Let*\(q(\bar x),q^{\prime }(\bar x^{\prime })\)*be**CQs, where*\(\bar x=(x_{1},\dots ,x_{n})\)*and*\(\bar x^{\prime }=(x_{1}^{\prime },\dots ,x_{n}^{\prime })\)*,**for**n* ≥ 0*.**Suppose that*\((q,\bar x)\to _{k} (q^{\prime },\bar x^{\prime })\)*.**Then, for each database*\(\mathcal {D}\)*and**tuple* (*A*_{1},…,*A*_{n}) *of subsets of*\(\mathcal {D}\)*such**that*\((\mathcal {D},(A_{1},\dots ,A_{n}))\)*has**generalized hypertreewidth at most* k*, it is the case that*

### Proof

Let \(\mathcal {H}\) be a winning strategy for Duplicator witnessing the fact that \((q,\bar x)\to _{k} (q^{\prime },\bar x^{\prime })\). Let us assume that \((\mathcal {D},(A_{1},\dots ,A_{n}))\) has generalized hypertreewidth at most k, and that \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (q,(x_{1},\dots ,x_{n}))\) is witnessed via a homomorphism h. Then we can compose h with the strategy \(\mathcal {H}\) to define a homomorphism g witnessing \((\mathcal {D},(A_{1},\dots ,A_{n}))\to (q^{\prime },(x_{1}^{\prime },\dots ,x_{n}^{\prime }))\). The mapping g is defined in a top-down fashion over the tree decomposition (*T*,*χ*) of width at most k of \((\mathcal {D},(A_{1},\dots ,A_{n}))\). One starts at the root r of T, and forces Spoiler to play his pebbles over the set *h*(*χ*(*r*)). If Duplicator responds according to \(\mathcal {H}\) with a partial homomorphism *f*_{r}, we then let *g*(*a*) = *f*_{r}(*h*(*a*)), for each *a* ∈ *χ*(*r*). We then move to each child of r and so on, until all leaves are reached and g is defined over all elements in \(\mathcal {D}\setminus (A_{1}\cup \cdots \cup A_{n})\). Since Duplicator responds to Spoiler’s moves with consistent partial homomorphisms, we have that g is actually a well-defined homomorphism from \((\mathcal {D},(A_{1},\dots ,A_{n}))\) to \((q^{\prime },(x_{1}^{\prime },\dots ,x_{n}^{\prime }))\). □

Now we are ready to show our lemma. Suppose that \((q,\bar x)\to _{k}(q^{\prime },\bar x^{\prime })\), where \(\bar x=(x_{1},\dots ,x_{n})\) and \(\bar x^{\prime }=(x_{1}^{\prime },\dots ,x_{n}^{\prime })\), for some *n* ≥ 0. Assume that \((q^{\prime \prime },\bar x^{\prime \prime })\to (q^{\prime }\wedge q, \bar z)\) via a homomorphism *h*, for \(q^{\prime \prime }(\bar x^{\prime \prime })\in \textsf {GHW}(k)\), and suppose that \(\bar x^{\prime \prime }=(x_{1}^{\prime \prime },\dots ,x_{n}^{\prime \prime })\) and \(\bar z=(z_{1},\dots ,z_{n})\). For each *i* ∈{1,…,*n*}, we define *V*_{i} to be the set of variables *x* in *q*^{″} such that *h*(*x*) = *z*_{i}. In particular, \(x_{i}^{\prime \prime }\in V_{i}\), for each *i* ∈{1,…,*n*}. We define *V* to be the set of variables *x* in *q*^{″} such that *h*(*x*) = *y*, where *y* is an existentially quantified variable of *q*. Similarly, we define *V*^{′} with respect to the existentially quantified variables of *q*^{′}. Note that the sets *V*,*V*^{′},*V*_{1},…,*V*_{n} form a partition of the variables of *q*^{″}.

Recall that \(\mathcal {D}_{q^{\prime \prime }}\) be the canonical database of *q*^{″}. Since *q*^{″}∈GHW(*k*), we know that

has generalized hypertreewidth at most *k*, as defined above. Let \(\mathcal {D}_{V}\) be the database induced in \(\mathcal {D}_{q^{\prime \prime }}\) by the set of variables *V* ∪ *V*_{1} ∪⋯ ∪ *V*_{n}, i.e., the set of atoms \(R(\bar t)\in \mathcal {D}_{q^{\prime \prime }}\) such that each element in \(\bar t\) is in *V* ∪ *V*_{1} ∪⋯ ∪ *V*_{n}. We now show that

has also generalized hypertreewidth at most *k*. Indeed, let (*T*,*χ*) be the tree decomposition of \((\mathcal {D}_{q^{\prime \prime }}\), \((\{x_{1}^{\prime \prime }\},\dots ,\{x_{n}^{\prime \prime }\}))\) of width at most *k*. Define *χ*^{′} such that for each *t* ∈ *T*, we have that *χ*^{′}(*t*) = *χ*(*t*) ∩ *V*. We claim that (*T*,*χ*^{′}) is a tree decomposition of \((\mathcal {D}_{V}\), (*V*_{1},…,*V*_{n})) of width at most *k*.

In fact, since (*T*,*χ*) is a tree decomposition, we have that, for each *a* ∈ *V*, it is the case that the set {*t* ∈ *T*∣*a* ∈ *χ*^{′}(*t*)} is connected; and for each atom \(R(\bar a)\in \mathcal {D}_{V}\), there is a node *t* ∈ *T* such that \(\bar a\cap V\subseteq \chi ^{\prime }(t)\). To see that the width of (*T*,*χ*^{′}) is bounded by *k*, let *t* be a node in *T*. Since the width of (*T*,*χ*) is at most *k*, there are *ℓ* atoms \(R(\bar a_{1}),\dots ,R(\bar a_{\ell })\) in \(\mathcal {D}_{q^{\prime \prime }}\), with *ℓ* ≤ *k*, such that \(\chi (t)\subseteq \bigcup _{1\leq i \leq \ell } \bar a_{i}\). Let \(R(\bar a_{i_{1}}),\dots ,R(\bar a_{i_{p}})\), where 1 ≤ *i*_{1} < ⋯ < *i*_{p} ≤ *ℓ* and *p* ≤ *ℓ*, be the atoms in \(\{R(\bar a_{1}),\dots ,R(\bar a_{\ell })\}\) that contain an element in *χ*^{′}(*t*). Since *χ*^{′}(*t*) ⊆ *χ*(*t*), it is the case that \(\chi ^{\prime }(t)\subseteq \bigcup _{1\leq j \leq p} \bar a_{i_{j}}\). It suffices to show that each \(R(\bar a_{i_{j}})\) is actually an atom in \(\mathcal {D}_{V}\), for 1 ≤ *j* ≤ *p*. Towards a contradiction, suppose that this is not the case. Then, there is an atom in \(\mathcal {D}_{q^{\prime \prime }}\) that contains simultaneously one variable in *χ*^{′}(*t*) ⊆ *V* and one variable in *V*^{′}. By the definitions of *V*^{′} and *V*, and the fact that *h* is a homomorphism, it follows that there is an atom in \((q^{\prime }\wedge q)(\bar z)\) that mentions simultaneously one existentially quantified variable from *q*^{′} and one from *q*; this contradicts the definition of \((q^{\prime }\wedge q)(\bar z)\). We conclude that the generalized hypertreewidth of \((\mathcal {D}_{V},(V_{1},\dots ,V_{n}))\) is at most *k*.

Recall that *h* is our initial homomorphism from \((q^{\prime \prime },\bar x^{\prime \prime })\) to \((q^{\prime }\wedge q, \bar z)\). Let *h*_{V} be the restriction of *h* to the set *V* ∪ *V*_{1} ∪⋯ ∪ *V*_{n}. By construction,

via homomorphism *h*_{V}. We can then apply Lemma 10 and obtain that

via a homomorphism *h*^{′}. We define our required homomorphism *g* from \((q^{\prime \prime },\bar x^{\prime \prime })\) to \((q^{\prime },\bar x^{\prime })\) as follows: if *a* ∈ *V* ∪ *V*_{1} ∪⋯ ∪ *V*_{n}, then *g*(*a*) = *h*^{′}(*a*); otherwise, if *a* ∈ *V*^{′}, then *g*(*a*) = *h*(*a*). To see that *g* is a homomorphism, it suffices to consider an atom \(R(\bar a)\in \mathcal {D}_{q^{\prime \prime }}\) such that \(\bar a\) contains an element in *V*^{′} and one element not in *V*^{′}, and show that \(R(g(\bar a))\in \mathcal {D}_{q^{\prime }}\). Let *A* be the set of elements in \(\bar a\) that are not in *V*^{′}. As mentioned above, there are no atoms in \(\mathcal {D}_{q^{\prime \prime }}\) mentioning elements in *V*^{′} and *V* simultaneously, thus *A* ⊆ *V*_{1} ∪⋯ ∪ *V*_{n}. In particular, *h*(*a*) = *h*^{′}(*a*), for each *a* ∈ *A*. It follows that \(R(g(\bar a))=R(h(\bar a))\), from which we conclude that \(R(g(\bar a))\in \mathcal {D}_{q^{\prime }}\). □

### Proof

(Proposition 10) Consider the Boolean CQ q from Fig. 2, defined as

and the CQ *q*^{′} from the same figure defined by

For each *n* ≥ 1, we define the CQ

Observe that *q*^{′}∧ *q*_{n} ∈GHW(1), for each *n* ≥ 1. We now show that, for each *n* ≥ 1, *q*^{′}∧ *q*_{n} is an incomparable GHW(1)-Δ-approximation of *q*. As mentioned in Example 2, we have that *q* →_{1}*q*^{′}. In particular *q* →_{1}(*q*^{′}∧ *q*_{n}). Clearly, *q*↛(*q*^{′}∧ *q*_{n}). Also, *q*_{n}↛*q* since variables *x*_{1} and *x*_{n+ 1} of *q*_{n} cannot be mapped to any variable in *q* via a homomorphism. Therefore, (*q*^{′}∧ *q*_{n})↛*q*. By Theorem 11, it follows that *q*^{′}∧ *q*_{n} is an incomparable GHW(1)-Δ-approximation of *q*.

Now we show that the CQs {*q*^{′}∧ *q*_{n}}_{n≥ 1} form a family of non-equivalent CQs. First note that *q*_{n}↛*q*^{′}, for each *n* ≥ 1. Also, observe that *q*_{i} → *q*_{j} iff *i* = *j*, for *i*,*j* ≥ 1. It follows that for each *i*,*j* ≥ 1, such that *i*≠*j*, it is the case that (*q*^{′}∧ *q*_{i})↛(*q*^{′}∧ *q*_{j}) and (*q*^{′}∧ *q*_{j})↛(*q*^{′}∧ *q*_{i}). In particular, {*q*^{′}∧ *q*_{n}}_{n≥ 1} is a family of non-equivalent CQs. □

### Proof

(Proposition 11) As already mentioned, the c oNP upper bound follows directly from Theorem 11. For the lower bound, we consider the N on-Hom(*H*) problem, for a fixed directed graph H, which asks, given a directed graph G, whether *G*↛*H*. Let us assume that, for each *k* ≥ 1, there is a directed graph *H*_{k} such that:

- 1.
*H*_{k}∈GHW(*k*), or more formally, the Boolean CQ \(q_{H_k}\) whose canonical database is*H*_{k}belongs to GHW(*k*). - 2.
N on-Hom(

*H*_{k}) is c oNP-complete even when the input directed graph G satisfies that*H*_{k}↛*G*.

We later explain how to obtain these graphs *H*_{k}’s. Now we reduce from the restricted version of N on-Hom(*H*_{k}) given by item (2) above. Let G be a directed graph such that *H*_{k}↛*G*. We first check in polynomial time whether *G* →_{k}*H*_{k}. If *G*↛_{k}*H*_{k}, we output a fixed pair \(q_0,q_0^{\prime }\) such that \(q_0^{\prime }\in \textsf {GHW}(k)\) and \(q_0^{\prime }\) is an incomparable GHW(*k*)-Δ-approximation of *q*_{0}. In case that *G* →_{k}*H*_{k}, we output the pair \(q_{G}, q_{H_k}\), where *q*_{G} and \(q_{H_k}\) are Boolean CQs whose canonical databases are precisely G and *H*_{k}, respectively. Since \(q_{H_k}\in \textsf {GHW}(k)\) by item (1) above, the reduction is well-defined.

Suppose first that *G*↛*H*_{k}. If *G*↛_{k}*H*_{k}, then we are done, since \(q_0^{\prime }\) is an incomparable GHW(*k*)-Δ-approximation of *q*_{0}. Otherwise, if *G* →_{k}*H*_{k}, since *G*↛*H*_{k} and *H*_{k}↛*G* (item (2) above), Theorem 11 implies that \(q_{H_k}\) is an incomparable GHW(*k*)-Δ-approximation of *q*_{G}. On the other hand, assume that *G* → *H*_{k}. In particular, we have that *G* →_{k}*H*_{k}, and then, in this case, the reduction outputs the pair \(q_{G}, q_{H_k}\). Since *G* → *H*_{k}, we conclude that \(q_{H_k}\) is not an incomparable GHW(*k*)-Δ-approximation of *q*_{G}.

It remains to define the directed graph *H*_{k}. If *k* ≥ 2, it suffices to consider the clique on 2*k* vertices, that is, the directed graph **K**_{2k} whose vertex set is {1,…, 2*k*} and whose edges are {(*i*,*j*)∣*i*≠*j*, for *i*,*j* ∈{1,…, 2*k*}}. We have that **K**_{2k} ∈GHW(*k*), and thus item (1) above is satisfied. Also, we can reduce from the non-2*k*-colorability problem by replacing each undirected edge {*u*,*v*} of a given undirected graph G, by a directed edge in an arbitrary direction, e.g., from u to v. Clearly, this is a reduction from non-2*k*-colorability to N on-Hom(**K**_{2k}). Also note that the output *f*(*G*) of the reduction satisfies that **K**_{2k}↛*f*(*G*), as *f*(*G*) has no directed loops nor directed cycles of length 2. Therefore, item (2) above is satisfied. For *k* = 1, it is known from [30] that there is an oriented tree T (i.e., a directed graph whose underlying undirected graph is a tree and has no directed cycles of length 1 (loops) and 2) such that N on-Hom(*T*) is c oNP-complete. Since T is an oriented tree then it belongs to GHW(1), and then item (1) is satisfied. Also, by inspecting the reduction in [30], we have that item (2) also holds. □

## Rights and permissions

## About this article

### Cite this article

Barceló, P., Romero, M. & Zeume, T. A More General Theory of Static Approximations for Conjunctive Queries.
*Theory Comput Syst* **64, **916–964 (2020). https://doi.org/10.1007/s00224-019-09924-0

Published:

Issue Date:

### Keywords

- Conjunctive queries
- Hypertreewidth
- Approximations
- Existential pebble game