Semantic Acyclicity Under Constraints⋆ Pablo Barceló1 , Georg Gottlob2 , and Andreas Pieris3 1 Center for Semantic Web Research & DCC, University of Chile pbarcelo@dcc.uchile.cl 2 Department of Computer Science, University of Oxford georg.gottlob@cs.ox.ac.uk 3 Institute of Information Systems, Vienna University of Technology pieris@dbai.tuwien.ac.at Query optimization is a fundamental database task that amounts to transform a query into one that is arguably more efficient to evaluate. The database theory commu- nity has developed several principled methods for optimization of conjunctive queries (CQs), many of which are based on static-analysis tasks such as containment [1]. In a nutshell, such methods compute a minimal equivalent version of a CQ, where min- imality refers to number of atoms. As argued by Abiteboul, Hull, and Vianu [1], this provides a theoretical notion of “true optimality” for the reformulation of a CQ, as op- posed to practical approaches based on heuristics. For each CQ, the minimal equivalent CQ is its core [16]. Although the static analysis tasks that support CQ minimization are NP-hard [9], this is not a problem for real-life applications, as the input CQ is small. It is known, on the other hand, that semantic information about the data, in the form of integrity constraints, alleviates query optimization by reducing the space of possi- ble reformulations. In the previous analysis, however, constraints play no role, as CQ equivalence is defined over all databases. Adding constraints yields a refined notion of CQ equivalence, which holds over those databases that satisfy a given set of constraints only. But finding a minimal equivalent CQ in this context is notoriously more difficult than before. This is because basic static analysis tasks such as containment become un- decidable when considered in full generality. This motivated a long research program for finding larger “islands of decidability” of such containment problem, based on syn- tactical restrictions on constraints [2, 6–8, 17, 18]. An important shortcoming of the previous approach, however, is that there is no theoretical guarantee that the minimized version of a CQ is in fact easier to evaluate (recall that, in general, CQ evaluation is NP-complete [9]). We know, on the other hand, quite a bit about classes of CQs which can be evaluated efficiently. It is thus a natural problem to ask whether constraints can be used to reformulate a CQ as one in such tractable classes, and if so, what is the cost of computing such reformulation. Following Abiteboul et al. [1], this would provide us with a theoretical guarantee of “true efficiency” for those reformulations. Here we concentrate on one of the oldest and most studied tractability conditions for CQs; namely, acyclicity. It is known that acyclic CQs can be evaluated in linear time [21]. More formally, let us write q ≡Σ q ′ whenever CQs q and q ′ are equivalent over all databases that satisfy Σ. In this work we study the following problem: ⋆ This paper is a short version of [4]. PROBLEM : S EMANTIC ACYCLICITY INPUT : A CQ q and a finite set Σ of constraints. QUESTION : Is there an acyclic CQ q ′ s.t. q ≡Σ q ′ ? We study this problem for the two most important classes of database constraints: 1. Tuple-generating dependencies (tgds), i.e., expressions of the form ∀x̄∀ȳ(ϕ(x̄, ȳ) → ∃z̄ψ(x̄, z̄)), where ϕ and ψ are conjuntions of atoms. Tgds subsume the important class of referential integrity constraints (or inclusion dependencies). 2. Equality-generating dependencies (egds), i.e., expressions of the form ∀x̄(ϕ(x̄) → y = z), where ϕ is a conjunction of atoms and y, z are variables in x̄. Egds subsume keys and functional dependencies (FDs). A useful aspect of tgds and egds is that containment under them can be studied in terms of the chase procedure [19]. Coming back to semantic acyclicity, the main problem we study is, of course, decid- ability. Since basic reasoning with tgds and egds is, in general, undecidable, we cannot expect semantic acyclicity to be decidable for arbitrary such constraints. Thus, we con- centrate on the following question: Decidability: For which classes of tgds and egds is the problem of semantic acyclicity decidable? In such cases, what is the computational cost of the problem? Since semantic acyclicity is defined in terms of CQ equivalence under constraints, and the latter has received a lot of attention, it is relevant also to study the following: Relationship to CQ equivalence: What is the relationship between CQ equivalence un- der constraints and semantic acyclicity under constraints? Is the latter decidable for each class of tgds and egds for which the former is decidable? Finally, we want to understand to what extent semantic acyclicity helps CQ evalua- tion. Although an acyclic reformulation of a CQ can be evaluated efficiently, computing such reformulation might be expensive. Thus, it is relevant to study the following: Evaluation: What is the computational cost of evaluating semantically acyclic CQs under constraints? Semantic acyclicity in the absence of constraints. The semantic acyclicity problem in the absence of dependencies (i.e., checking whether a CQ q is equivalent to an acyclic one over the set of all databases) is by now well-understood. Regarding decidability, it is easy to prove that a CQ q is semantically acyclic iff its core q ′ is acyclic. (Recall that such q ′ is the minimal equivalent CQ to q). It follows that checking semantic acyclic- ity in the absence of constraints is NP-complete (see, e.g., [5]). Regarding evaluation, semantically acyclic CQs can be evaluated efficiently [10, 11, 15]. The relevance of constraints. In the absence of constraints a CQ q is equivalent to an acyclic one iff its core q ′ is acyclic. Thus, the only reason why q is not acyclic in the first place is because it has not been minimized. This tells us that in this context semantic acyclicity is not different from usual minimization. The presence of constraints, on the other hand, yields a more interesting notion of semantic acyclicity. This is because constraints can be applied on CQs to produce acyclic reformulations of them. Example 1. This simple example helps understanding the role of tgds when reformulat- ing CQs as acyclic ones. Consider a database that stores information about customers, records, and musical styles. The relation Interest contains pairs (c, s) such that cus- tomer c has declared interest in style s. The relation Class contains pairs (r, s) such that record r is of style s. Finally, the relation Owns contains a pair (c, r) when customer c owns record r. Consider now a CQ q(x, y) defined as follows: ( ) ∃z Interest(x, z) ∧ Class(y, z) ∧ Owns(x, y) . This query asks for pairs (c, r) such that customer c owns record r and has expressed interest in at least one of the styles with which r is associated. This CQ is a core but it is not acyclic. Thus, from our previous observations it is not equivalent to an acyclic CQ (in the absence of constraints). Assume now that we are told that this database contains compulsive music collectors only. In particular, each customer owns every record that is classified with a style in which he/she has expressed interest. This means that the database satisfies the tgd: τ = Interest(x, z), Class(y, z) → Owns(x, y). We can now easily reformulate q(x, y) as the following acyclic CQ q ′ (x, y): ( ) ∃z Interest(x, z) ∧ Class(y, z) . Notice that q and q ′ are in fact equivalent over every database that satisfies τ . Contributions. We observe that semantic acyclicity under constraints is not only more powerful, but also theoretically more challenging than in the absence of them. We start by studying decidability. Results for tgds: Having a decidable CQ containment problem is a necessary condition for semantic acyclicity to be decidable under tgds.4 Surprisingly enough, it is not a suf- ficient condition. This means that, contrary to what one might expect, there are natural classes of tgds for which CQ containment but not semantic acyclicity is decidable. In particular, this is the case for the well-known class of full tgds (i.e., tgds without ex- istentially quantified variables in the head). In conclusion, we cannot directly export techniques from CQ containment to deal with semantic acyclicity. In view of the previous results, we concentrate on classes of tgds that (a) have a decidable CQ containment problem, and (b) do not contain the class of full tgds. These restrictions are satisfied by several expressive languages considered in the literature. Such languages can be classified into three main families depending on the techniques used for studying their containment problem: (i) guarded tgds [6], which contain inclu- sion and linear dependencies, (ii) non-recursive [12], and (iii) sticky sets of tgds [7]. Instead of studying such languages one by one, we identify two semantic criteria that yield decidability for the semantic acyclicity problem, and then show that each one of the languages satisfies one such criteria. 4 Modulo some mild technical assumptions; see the extended version for more details. – The first criterion is acyclicity-preserving chase. This is satisfied by those tgds for which the application of the chase over an acyclic instance preserves acyclicity. Guarded tgds enjoy this property. From the analysis we conclude that semantic acyclicity under guarded tgds is decidable and has the same complexity as CQ containment: 2E XP T IME-complete, and NP-complete for a fixed schema. – The second criterion is rewritability by unions of CQs (UCQs). Intuitively, a class C of sets of tgds has this property if the CQ containment problem under a set in C can always be reduced to a UCQ containment problem without constraints. Non- recursive and sticky sets of tgds enjoy this property. In the first case the complexity matches that of CQ containment: NE XP T IME-complete, and NP-complete if the schema is fixed. In the second case, we get a NE XP T IME upper bound and an E XP T IME lower bound. For a fixed schema the problem is NP-complete. The NP bounds for tgds in such decidable classes (under a fixed schema) can be seen as positive results: By spending exponential time in the size of the (small) query, we can not only minimize it, but also find an acyclic reformulation if one exists. Results for egds: After showing that the techniques developed for tgds cannot be ap- plied for showing the decidability of semantic acyclicity under egds, we focus on the class of keys over unary and binary predicates and we establish a positive result, namely semantic acyclicity is NP-complete. We prove this by showing that in such context keys have acyclicity-preserving chase. Interestingly, this positive result can be extended to unary functional dependencies (over unconstrained signatures); this result has been es- tablished independently by Figueira [13]. We leave open whether the problem of seman- tic acyclicity under arbitrary egds, or even keys over arbitrary schemas, is decidable. Evaluation: For tgds for which semantic acyclicity is decidable (guarded, non-recursive, sticky), we can use the following algorithm to evaluate a semantically acyclic CQ q over a database D that satisfies the constraints Σ: (1) Convert q into an equivalent acyclic CQ q ′ under Σ, (2) evaluate q ′ on D, and (3) return q(D) = q ′ (D). The running time is O(|D|·f (|q|, |Σ|)), where f is a double-exponential function (since q ′ can be computed in double-exponential time for each one of the classes mentioned above and acyclic CQs can be evaluated in linear time). This constitutes a fixed-parameter tractable algorithm for evaluating q on D. No such algorithm is believed to exist for CQ evaluation [20]; thus, semantically acyclic CQs under these constraints behave better than the general case in terms of evaluation. However, in the absence of constraints one can do better: Evaluating semantically acyclic CQs in such context is feasible in polynomial time. It is natural to ask if this also holds in the presence of constraints. This is the case for guarded tgds and FDs. For the other classes the problem remains to be investigated. Finite vs. infinite databases. The results mentioned above interpret the notion of CQ equivalence (and, thus, semantic acyclicity) over arbitrary (finite or infinite) databases. The reason is the wide application of the chase we make in our proofs, which charac- terizes CQ equivalence under arbitrary databases only. This is not a serious problem though, as all the particular classes of tgds for which we prove decidability in the paper (i.e., guarded, non-recursive, sticky) are finitely controllable [3, 14]. This means that CQ equivalence under arbitrary databases and under finite databases coincide. Thus, the results we obtain for such classes can be directly exported to the finite case. Acknowledgements. Barceló would like to thank D. Figueira, M. Romero, S. Rudolph, and N. Schweikardt for insightful discussions about the nature of semantic acyclicity under constraints. Barceló is funded by the Millenium Nucleus Center for Semantic Web Research under grant NC120004. Gottlob is supported by the EPSRC Programme Grant EP/M025268/ “VADA: Value Added Data Systems – Principles and Architec- ture”. Pieris is supported by the Austrian Science Fund (FWF), projects P25207-N23 and Y698, and Vienna Science and Technology Fund (WWTF), project ICT12-015. References 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995) 2. Baget, J.F., Mugnier, M.L., Rudolph, S., Thomazo, M.: Walking the complexity lines for generalized guarded existential rules. In: IJCAI. pp. 712–717 (2011) 3. Bárány, V., Gottlob, G., Otto, M.: Querying the guarded fragment. Logical Methods in Com- puter Science 10(2) (2014) 4. Barceló, P., Gottlob, G., Pieris, A.: Semantic acyclicity under constraints. In: PODS (2016), to appear 5. Barceló, P., Romero, M., Vardi, M.Y.: Semantic acyclicity on graph databases. In: PODS. pp. 237–248 (2013) 6. Calı̀, A., Gottlob, G., Kifer, M.: Taming the infinite chase: Query answering under expressive relational constraints. J. Artif. Intell. Res. 48, 115–174 (2013) 7. Calı̀, A., Gottlob, G., Pieris, A.: Towards more expressive ontology languages: The query answering problem. Artif. Intell. 193, 87–128 (2012) 8. Calvanese, D., De Giacomo, G., Lenzerini, M.: Conjunctive query containment and answer- ing under description logic constraints. ACM Trans. Comput. Log. 9(3) (2008) 9. Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: STOC. pp. 77–90 (1977) 10. Chen, H., Dalmau, V.: Beyond hypertree width: Decomposition methods without decompo- sitions. In: CP. pp. 167–181 (2005) 11. Dalmau, V., Kolaitis, P.G., Vardi, M.Y.: Constraint satisfaction, bounded treewidth, and finite-variable logics. In: CP. pp. 310–326 (2002) 12. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: Semantics and query answer- ing. Theor. Comput. Sci. 336(1), 89–124 (2005) 13. Figueira, D.: Semantically acyclic conjunctive queries under functional dependencies. In: LICS (2016), to appear 14. Gogacz, T., Marcinkowski, J.: Converging to the chase - A tool for finite controllability. In: LICS. pp. 540–549 (2013) 15. Gottlob, G., Greco, G., Marnette, B.: Hyperconsistency width for constraint satisfaction: Al- gorithms and complexity results. In: Graph Theory, Computational Intelligence and Thought. pp. 87–99 (2009) 16. Hell, P., Nešetřil, J.: Graphs and Homomorphisms. Oxford University Press (2004) 17. Johnson, D.S., Klug, A.C.: Testing containment of conjunctive queries under functional and inclusion dependencies. J. Comput. Syst. Sci. 28(1), 167–189 (1984) 18. Krötzsch, M., Rudolph, S.: Extending decidable existential rules by joining acyclicity and guardedness. In: IJCAI. pp. 963–968 (2011) 19. Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of data dependencies. ACM Trans. Database Syst. 4(4), 455–469 (1979) 20. Papadimitriou, C.H., Yannakakis, M.: On the complexity of database queries. J. Comput. Syst. Sci. 58(3), 407–427 (1999) 21. Yannakakis, M.: Algorithms for acyclic database schemes. In: VLDB. pp. 82–94 (1981)