Efficient Evaluation of Well-designed Pattern Trees (Extended Abstract)? Pablo Barceló1 , Reinhard Pichler2 , and Sebastian Skritek2 1 Center for Semantic Web Research & Department of Computer Science, University of Chile 2 Faculty of Informatics, Vienna University of Technology Abstract. Conjunctive queries (CQs) constitute the core of the query languages for relational databases and also the most intensively studied querying mecha- nism in the database theory community. But CQs suffer from a serious drawback when dealing with incomplete information: If it is not possible to match the com- plete query with the data, they return no answer at all. The semantic web there- fore provides a formalism - known as well-designed pattern trees (WDPTs) - that tackles this problem. In particular, WDPTs allow us to match patterns over the data if available, but do not fail to give an answer otherwise. Here, we abstract away the specifics of semantic web applications and study WDPTs over arbitrary relational schemas. Since our language properly subsumes the class of CQs, the evaluation problem associated with it is intractable. In this paper we identify nat- ural structural properties of WDPTs that lead to tractability of various variants of the evaluation problem. 1 Introduction Conjunctive queries (CQs) constitute the core of the query languages for relational databases and also the most intensively studied querying mechanism in the database theory community. But CQs suffer from a serious drawback when dealing with incom- plete information: they fail to provide an answer when the pattern described by the query cannot be matched completely into the data. The semantic web therefore provides formalisms to overcome this problem. One simple such formalism corresponds to the {AND,OPT}-fragment of SPARQL – the standard query language for RDF, the semantic web data model. The OPT-operator extends the AND-operator by the possibility to return partial answers. I.e., instead of returning no answer at all if not the complete query can be matched into the data, it al- lows to match parts of the query. Pérez et al. noticed that a non-constrained interaction of these two operators may lead to undesired behavior [11]. This motivated the defini- tion of a better behaved syntactic restriction of the language, known as well-designed {AND,OPT}-SPARQL. Queries in this fragment allow for a natural tree representa- tion, called well-designed pattern trees (WDPTs) [10]. Here, we abstract away from the specifics of RDF and define WDPTs over arbitrary relational schemas. Despite the importance of WDPTs, very little is known about some fundamental problems related to them. In particular, no in-depth study has been carried out regarding ? This is an extended abstract of [3] efficient evaluation of these queries, a problem that permeates the literature on CQs and its extensions [13, 8, 9]. The main goal of this work is to initiate a systematic study of tractable fragments of WDPTs for the different variants of query evaluation that have been studied in the literature. We explain this in more detail below. 2 Well-designed Pattern Trees We define the class of WDPTs below. Intuitively, the nodes of a WDPT represent CQs (called “basic graph patterns” in the semantic web context) while the nesting of optional matching is represented by the tree structure of a WDPT. A well-designed pattern tree (WDPT) over schema σ is a pair (T, λ, x̄), such that: 1. T is a rooted tree and λ maps each node t in T to a set of relational atoms over σ. 2. For every variable y mentioned in T , the set of nodes where y occurs is connected. 3. The tuple x̄ of distinct variables from T denotes the free variables of the WDPT. We say that (T, λ, x̄) is projection-free, if x̄ contains all variables mentioned in T . Assume p = (T, λ, x̄) is a WDPT over σ. We write r to denote the root of T . Given a subtree T 0 of T rooted in r, we define qTS 0 to be the CQ Ans(ȳ) ← R1 (v̄1 ), . . . , Rm (v̄m ), where {R1 (v̄1 ), . . . , Rm (v̄m )} = t∈T 0 λ(t), and ȳ are all the variables that are mentioned in T 0 . That is, all variables in qT 0 appear free. We define the semantics of WDPTs by naturally extending their interpretation un- der semantic web vocabularies [10]. The intuition behind the semantics of a WDPT (T, λ, x̄) is as follows. A mapping h is an answer to (T, λ) over a database D, if it is “maximal” among the mappings that satisfy the patterns qT 0 defined by the subtrees T 0 of T . This means, h is a solution to qT 0 and there is no way to “extend” h to a solution of some qT 00 for some bigger subtree T 00 of T . The evaluation of a WDPT (T, λ, x̄) over D corresponds then to the projection over the variables in x̄ of the mappings h that satisfy (T, λ) over D. We formalize this next: Let D be a database and p = (T, λ, x̄) a WDPT over schema σ. Assume that dom(D) is the set of elements in the active domain of D and X are the variables mentioned in p. Then: – A homomorphism from p to D is a partial mapping h : X → dom(D), for which it is the case that there is a subtree T 0 of T rooted in r such that h is a homomorphism from the CQ qT 0 to D. – The homomorphism h is maximal if there is no homomorphism h0 from p to D such that h0 extends h. If h is a homomorphism from p = (T, λ, x̄) to D we denote by hx̄ the restriction of h to the variables in x̄. The evaluation of p over D, denoted p(D), corresponds to all mappings of the form hx̄ , such that h is a maximal homomorphism from p to D. Notice that WDPTs properly extend CQs. In fact, assume q(x̄) is a CQ of the form Ans(x̄) ← R1 (v̄1 ), . . . , Rm (v̄m ). Then q(x̄) is equivalent to WDPT p = (T, λ, x̄), where T consists of a single node r and λ(r) = {R1 (v̄1 ), . . . , Rm (v̄m )}. In other words, q(D) = p(D), for each database D. 3 Efficient evaluation of WDPTs 3.1 Evaluation of WDPTs: We study the complexity of the evaluation problem E VAL(C) for different classes C of WDPTs. This problem is formally defined as follows: Given a database D and a WDPT p over σ, as well as a partial mapping h : X → dom(D), where X is the set of variables mentioned in p, is it the case that h belongs to p(D)? The complexity of E VAL(C) has been studied for the case when C is the class Call of all WDPTs or the class Cpf of projection-free WDPTs. In particular, E VAL(Call ) is Σ2P -complete [10] and E VAL(Cpf ) is CO NP-complete [11]. This raises the need for understanding which classes of WDPTs can be evaluated in polynomial time. Evaluation of WDPTs is defined in terms of CQ evaluation, which is an intractable problem in general. Therefore, our goal of identifying tractable classes of WDPTs nat- urally calls for a restriction of the classes of CQ patterns allowed in them. In particular, there has been a flurry of activity around the topic of determining which classes of CQs admit efficient evaluation that could be reused in our scenario [13, 8, 9]. These include classes of bounded treewidth [6], hypertreewidth [9], etc. We concentrate on the first two. From now on, we denote by TW(k) (resp., HW(k)), for k ≥ 1, the class of CQs of treewidth (resp., hypertreewidth) at most k. A condition that has been shown to help identifying relevant tractable fragments of WDPTs is local tractability [10]. This refers to restricting the CQ defined by each node in a WDPT to belong to a tractable class. Formally, let C be either TW(k) or HW(k), for k ≥ 1. A WDPT (T, λ, x̄) is locally in C, if for each node t ∈ T such that λ(t) = {R1 (v̄1 ), . . . , Rm (v̄m )} the CQ Ans() ← R1 (v̄1 ), . . . , Rm (v̄m ) is in C. We write `-C for the set of all WDPTs that are locally in C. It is known that local tractability leads to tractability of evaluation for projection- free WDPTs [10]. On the other hand, this result does not hold in the presence of pro- jection, even when C is of bounded treewidth. Formally, E VAL(`-TW(k)) and E VAL(`- HW(k)) are NP-complete for every k ≥ 1 [10]. This raises the question of which further restrictions on WDPTs are needed to achieve tractability. Here we identify a natural such restriction, called bounded inter- face. Intuitively, this restricts the number of variables shared between a node in a WDPT and its children. Formally, a WDPT (T, λ, x̄) has c-bounded interface, for c ≥ 1, if for each node t ∈ T with children t1 , . . . , tk it is the case that the number of variables that appear both in a relational atom in λ(t) and in a relational atom in λ(ti ), for some 1 ≤ i ≤ k, is at most c. We denote by BI(c) the set of WDPTs of c-bounded interface. Interestingly, similar restrictions on the number of variables shared by different atoms of CQs have been recently applied for obtaining reasonable bounds for the problem of containment of Datalog into unions of CQs [5]. One of our main results states that local tractability and bounded interface yield tractability of WDPT evaluation: Theorem 1. Let C be TW(k) or HW(k) and c ≥ 1. Then E VAL(`-C ∩ BI(c)) is in P TIME. Notice that CQs are a special case of WDPTs consisting of the root node only. Hence, TW(k) ⊆ `-TW(k) ∩ BI(c) and HW(k) ⊆ `-HW(k) ∩ BI(c) hold for each c ≥ 1. Therefore, Theorem 1 tells us that `-TW(k)∩BI(c) and `-HW(k)∩BI(c) define relevant extensions of TW(k) and HW(k), respectively, that preserve tractability of evaluation. 3.2 Partial evaluation of WDPTs: Given the nature of WDPTs, it is also interesting to check whether a mapping h is a partial answer to the WDPT p over D [11, 1], i.e., whether h can be extended to some answer h0 to p over D. This gives rise to the partial evaluation problem PARTIAL - E VAL(C) for a class C of WDPTs defined as follows: Given a database D and a WDPT p ∈ C over σ, as well as a partial mapping h : X → U, where X is the set of variables mentioned in p, does there exists some h0 ∈ p(D) such that h0 extends h? If projection is allowed, then partial evaluation is NP-complete even under local tractability, i.e., even for the classes `-TW(k) and `-HW(k), for each k ≥ 1 [10]. It is easy to modify the proof of Theorem 1 to show that adding bounded inter- face to local tractability yields efficient partial evaluation; that is, PARTIAL -E VAL(`- TW(k) ∩ BI(c)) and PARTIAL -E VAL(`-HW(k) ∩ BI(c)) are in P TIME. However, par- tial evaluation is seemingly easier than exact evaluation. Hence, the question naturally arises if tractability of partial evaluation of WDPTs can be ensured by a weaker condi- tion. Indeed, we give a positive answer to this question below. This condition will be re- ferred to as global tractability. Intuitively, it states that there is a bound on the treewidth (resp., hypertreewidth) of the CQs defined by the different subtrees of a WDPT (T, λ, x̄) rooted in r. Formally, let C be TW(k) or HW(k), for k ≥ 1. A WDPT (T, λ, x̄) is glob- ally in C, if for each subtree T 0 of T rooted in r it is the case that the CQ qT 0 is in C. We denote by g-C the set of all WDPTs that are globally in C. The following proposition formally states that global tractability is a strictly weaker condition than the conjunction of local tractability and bounded interface. Proposition 1. 1. Let k, c ≥ 1. Then `-TW(k) ∩ BI(c) ⊆ g-TW(k + 2c) and `- HW(k) ∩ BI(c) ⊆ g-HW(k + 2c). 2. For every k ≥ 1 there is a family Ck of WDPTs in g-TW(k) (resp., in g-HW(k)) such that Ck 6⊆ BI(c), for each c ≥ 1. We can now formally state that global tractability leads to tractability of the partial evaluation problem for WDPTs: Theorem 2. PARTIAL -E VAL(g-TW(k)) and PARTIAL -E VAL(g-HW(k)) are in P TIME for every k ≥ 1. It remains to answer the question if global tractability also suffices to ensure tractability of (exact) evaluation for WDPTs. It turns out that this is not the case. Proposition 2. E VAL(g-TW(k)) and E VAL(g-HW(k)) are NP-complete for all k ≥ 1. 3.3 Semantics based on maximal mappings: The semantics of projection-free WDPTs is only based on maximal mappings, i.e., map- pings that are not subsumed by any other mapping in the answer. This is no longer the case in the presence of projection [10]. Recent work on query answering for SPARQL under entailment regimes has established the need for a semantics for WDPTs that is uniquely based on maximal mappings [1]. This semantics is formalized as follows. As- sume D is a database and p is a WDPT over σ. The evaluation of p over D under maximal mappings, denoted pm (D), corresponds to the restriction of p(D) to those mappings h ∈ p(D) that are not extended by any other mapping h0 ∈ p(D). This natu- rally leads to the decision problem M AX -E VAL(C) defined as follows: Given a database D and a WDPT p ∈ C over σ, as well as a partial mapping h : X → U, where X is the set of variables mentioned in p, is h ∈ pm (D)? It follows from [1] that M AX -E VAL(C) is clearly intractable for the class C of all WDPTs. Analogously to PARTIAL -E VAL, local tractability is not sufficient to ensure tractability of M AX -E VAL: Proposition 3. For every k ≥ 1 the problems M AX -E VAL(`-TW(k)) and M AX - E VAL(`-HW(k)) are NP-hard. To obtain tractability in this case it is however sufficient to impose global tractability, which is exactly the same condition that yields tractability of partial evaluation for WDPTs (as stated in Theorem 2): Theorem 3. M AX -E VAL(g-TW(k)) and M AX -E VAL(g-HW(k)) are in P TIME for ev- ery k ≥ 1. 4 Further Results Taking these results as a starting point, we were also able to show that in several cases the complexity of static analysis tasks, like deciding containment and equivalence [12], decreases. Next, we also studied the problem of testing if some WDPT is equivalent to one from a tractable class (cf. e.g. [2, 4, 7]), and showed that evaluating such queries is fixed-parameter tractable w.r.t. the size of the query. Finally, we also studied the problem of approximating WDPTs by one from a tractable class (a problem that is now well-understood in the context of CQs [2]). Acknowledgments. The work of Pablo Barceló is funded by the Millenium Nucleus Center for Semantic Web Research under grant NC120004. Part of this work was done while Reinhard Pichler and Sebastian Skritek were visiting Pablo Barceló on invitation by the Millenium Nucleus Center for Semantic Web Research. Reinhard Pichler and Se- bastian Skritek were supported by the Vienna Science and Technology Fund (WWTF) through project ICT12-015 and by the Austrian Science Fund (FWF):P25207-N23. References 1. S. Ahmetaj, W. Fischl, R. Pichler, M. Simkus, and S. Skritek. Towards reconciling sparql and certain answers. In WWW’15, 2015. Accepted for publication. 2. P. Barceló, L. Libkin, and M. Romero. Efficient approximations of conjunctive queries. SIAM J. Comput., 43(3):1085–1130, 2014. 3. P. Barcelo, R. Pichler, and S. Skritek. Efficient evaluation and approximation of well- designed pattern trees. In PODS’15, 2015. Accepted for publication. 4. P. Barceló, M. Romero, and M. Y. Vardi. Semantic acyclicity on graph databases. In PODS’13, pages 237–248, 2013. 5. P. Barceló, M. Romero, and M. Y. Vardi. Does query evaluation tractability help query containment? In PODS’14, pages 188–199, 2014. 6. C. Chekuri and A. Rajaraman. Conjunctive query containment revisited. Theor. Comput. Sci., 239(2):211–229, 2000. 7. V. Dalmau, P. G. Kolaitis, and M. Y. Vardi. Constraint satisfaction, bounded treewidth, and finite-variable logics. In CP’02, pages 310–326, 2002. 8. G. Gottlob, N. Leone, and F. Scarcello. The complexity of acyclic conjunctive queries. J. ACM, 48(3):431–498, 2001. 9. G. Gottlob, N. Leone, and F. Scarcello. Hypertree decompositions and tractable queries. J. Comput. Syst. Sci., 64(3):579–627, 2002. 10. A. Letelier, J. Pérez, R. Pichler, and S. Skritek. Static analysis and optimization of semantic web queries. ACM Trans. Database Syst., 38(4):25, 2013. 11. J. Pérez, M. Arenas, and C. Gutierrez. Semantics and complexity of SPARQL. ACM Trans. Database Syst., 34(3), 2009. 12. R. Pichler and S. Skritek. Containment and equivalence of well-designed SPARQL. In PODS’14, pages 39–50, 2014. 13. M. Yannakakis. Algorithms for acyclic database schemes. In VLDB’81, pages 82–94, 1981.