Acyclic Query Answering Under Guarded Disjunctive Existential Rules and Consequences to DLs Pierre Bourhis1 , Michael Morak2 , and Andreas Pieris2 1 CNRS LIFL Université Lille1/INRIA Lille, France. pierre.bourhis@univ-lille1.fr 2 University of Oxford, Department of Computer Science, United Kingdom. {michael.morak,andreas.pieris}@cs.ox.ac.uk Abstract. The complete picture of the complexity of conjunctive query answer- ing under guarded disjunctive existential rules has been recently settled. However, in the case of (unions of) acyclic conjunctive queries ((U)ACQs) there are some fundamental questions which are still open. It is the precise aim of the present paper to close those questions, and to understand whether the acyclicity of the query has a positive impact on the complexity of query answering. Our main re- sult states that acyclic conjunctive query answering under a fixed set of guarded disjunctive existential rules is E XP T IME-hard. This result together with an E XP - T IME upper bound obtained by exploiting classical results on guarded first-order logic, gives us a complete picture of the complexity of our problem. We also show that our results can be used as a generic tool for establishing results on (U)ACQ answering under several central DLs. In fact, restricting the query lan- guage to UACQs improves the complexity to E XP T IME-complete for any DL between DL-Litebool and ALCHI; this holds even for fixed TBoxes. 1 Introduction Rule-based languages have a prominent presence in the areas of artificial intelligence and databases. A noticeable formalism, originally intended for expressing complex queries over relational databases, is Datalog, i.e., function-free first-order Horn logic. As Datalog itself is not able to infer the existence of new objects that do not occur in the extensional database, strong interest in enhancing Datalog with existential quantifi- cation in rule-heads emerged in recent years; see, e.g., [5, 6, 12, 23, 26]. The obtained rules are known under a variety of names such as existential rules, tuple-generating dependencies (TGDs), and Datalog± rules. However, these languages are still not ex- pressive enough for non-deterministic reasoning, e.g., to express simple statements like “a grandchild is a male or a female”, which can be logically expressed using the rules ∀X∀Y (parentOf (X, Y ) ∧ isfather (Y ) → ∃Z(grparentOf (X, Z) ∧ human(Z))) ∀X(human(X) → male(X) ∨ female(X)). Extending TGDs to allow such a disjunction in the head yields the formalism of disjunc- tive TGDs (DTGDs) [17]. Such an extension of plain Datalog (a.k.a. full TGDs), called disjunctive Datalog, has been studied in [18]. However, the addition of existential quan- tifiers easily leads to undecidability of the main reasoning tasks, including conjunctive query (CQ) answering [8]. Several concrete languages which ensure decidability have thus been proposed; see, e.g., [5, 10, 12, 13, 19, 23, 24]. In this paper, we will focus on a concept called “guard- edness” [10], which requires that all universally quantified variables in a rule appear together in an atom in the left-hand side of the implication. Guardedness is a well- accepted paradigm, giving rise to robust languages that capture important description logics such as DL-Lite [14], EL [4] and even ALCI. Recently, the complexity picture for query answering under guarded DTGDs has been investigated and completed for (unions of) arbitrary CQs [9]. Also for simple atomic queries the complexity has been investigated (see, e.g., [1, 21]). As acyclic CQs (ACQs) in the literature usually lead to a reduction in the complexity of the query answering problem, the objective of the present paper is to investigate whether this positive impact can also be observed in our setting. To this end, we aim to pinpoint the exact complexity of answering (unions of) acyclic CQs ((U)ACQs), in order to compare it to existing results for answering (unions of) arbitrary CQs. More precisely, we concentrate on the following fundamental question: What is the exact complexity of answering (U)ACQs under guarded DTGDs, and how is it affected if we consider a signature of bounded arity, or a fixed set of dependencies? After establishing the complexity of (U)ACQ answering, we then investigate the consequences to DLs. We show that our results can be used as a generic tool for es- tablishing results on (U)ACQ answering under several central DLs that can be found in the literature such as the key DL ALCHI, that is, the DL ALC extended with role hierarchies and inverse roles. Our contributions can be summarized as follows: 1. We show that UACQ answering under guarded DTGDs with predicates of bounded arity is in E XP T IME. This upper bound is established by exploiting results from guarded first-order logic [22]; 2. We show that ACQ answering under fixed sets of guarded DTGDs is E XP T IME- hard. This is a strong and general lower bound obtained by simulating the behavior of an alternating linear space Turing machine. 3. Finally, we show that our results for fixed arity and fixed sets extend to a number of expressive DLs, namely ALCHI, ELUHI, DL-LiteH bool and also DL-Litebool . For each of these DLs, and any others between them, answering UACQs, even over fixed TBoxes, is E XP T IME-complete. The upper bounds follow from the fact that guarded DTGDs are powerful enough to capture each of these DLs, and they thus inherit the positive effects of restricting the query language to acyclic queries. The lower bounds follow from the construction of our proof for guarded DTGDs. 2 Preliminaries General. Let C, N and V be pairwise disjoint infinite countable sets of constants, (la- beled) nulls and variables, respectively. We denote by X sequences (or sets) of variables X1 , . . . , Xk . Let [n] = {1, . . . , n}, for n > 1. A term is a constant, null or variable. An atom has the form p(t1 , . . . , tn ), where p is an n-ary predicate, and t1 , . . . , tn are terms. For an atom a, dom(a) and var (a) are the set of its terms and the set of its vari- ables, respectively; those notations extend to sets of atoms. Usually conjunctions and disjunctions of atoms are treated as sets of atoms. An instance I is a (possibly infinite) set of atoms of the form p(t), where t is a tuple of constants and nulls. A database D is a finite instance with only∧constants. Whenever an instance I is treated as a logical formula, is the formula ∃X ( a∈I I), where X contains a variable for each null in I. Conjunctive Queries. A conjunctive query (CQ) q is a sentence ∃Xφ(X), where φ is a conjunction of atoms. If q does not have free variables, then it is called Boolean. For brevity, we consider only Boolean CQs; however, all the results of the paper can be easily extended to non-Boolean CQs. A union of conjunctive queries (UCQ) is a disjunction of a finite number of CQs. By abuse of notation, sometimes we consider a UCQ as set of CQs. A CQ q = ∃Xφ(X) has a positive answer over an instance I, written I |= q, if there exists a homomorphism h such that h(φ(X)) ⊆ I. The answer to a UCQ Q over I is positive, written I |= Q, if there exists q ∈ Q such that I |= q. Several subclasses of conjunctive queries have been considered in the literature, with the aim of reducing the complexity of the CQ evaluation problem. This paper focuses on the class of acyclic CQs (ACQs); see, e.g., [16]. The acyclicity of a CQ is defined via the acyclicity of its hypergraph. Informally, a hypergraph is acyclic if it can be reduced to the empty hypergraph by iteratively eliminating some non-maximal hyperedge, or some vertex contained in at most one hyperedge; this procedure is known as the GYO algorithm. The formal definition is as follows. Consider a hypergraph H. The GYO-reduct of H, denoted GYO(H), is obtained by applying exhaustively the following two steps: (1) eliminate vertices that are contained in at most one hyperedge; and (2) eliminate hyperedges that are empty or contained in other hyperedges. We say that H is acyclic if GYO(H) = ⟨∅, ∅⟩. The hypergraph of a CQ q, denote H(q), is a hypergraph ⟨V, H⟩, where V = dom(q), and, for each atom a in q, there exists a hyperedge h ∈ H such that h = dom(a). We say that q is acyclic iff H(q) is acyclic. Disjunctive Tuple-Generating Dependencies. A disjunctive ∨ntuple-generating de- pendency (DTGD) σ is a first-order formula ∀X (φ(X) → i=1 ∃Yi ψi (X, Yi )), where n > 1, X ∪ Y1 ∪ . . . ∪ Yn ⊂ V, and φ, ψ1 , . . . , ψn are∨conjunctions of atoms. n The formula φ is called the body of σ, denoted body(σ), while i=1 ψi is the head of σ, denoted head (σ). If n = 1, then σ is called tuple-generating dependency (TGD). The schema of a set Σ of DTGDs, denoted sch(Σ), is the set of all predicates occurring in Σ. For brevity, we will omit the universal quantifiers in front of DTGDs, and use the comma (instead of ∧) for conjoining atoms. An instance I satisfies σ, written I |= σ, if the following holds: Whenever there exists a homomorphism h such that h(φ(X)) ⊆ I, then there exists i ∈ [n] and h′ ⊇ h such that h′ (ψi (X, Yi )) ⊆ I; I satisfies a set Σ of DTGDs, denoted I |= Σ, if I |= σ, for each σ ∈ ∧ Σ. Whenever a set Σ of DTGDs is treated as a logical formula, in fact is the formula ( σ∈Σ σ). As said, (U)CQ answering under DTGDs (which is defined below) is undecidable; in fact, is undecidable already for TGDs [8], even when the set of TGDs is fixed and the query is acyclic [11], or even when the set of TGDs is singleton [5]. A well-known prop- erty which guarantees decidability of query answering is guardedness [11]. A DTGD σ is guarded if there exists an atom a ∈ body(σ), called guard, which contains all the variables occurring in body(σ), i.e., var (a) = var (body(σ)). Query Answering. The models of a database D and a set Σ of DTGDs, denoted mods(D, Σ), is the set of instances {I | I ⊇ D and I |= Σ}. The answer to a CQ q w.r.t. D and Σ is positive, denoted D ∪ Σ |= q, if I |= q, for each I ∈ mods(D, Σ). The answer to a UCQ w.r.t. D and Σ is defined analogously. The problem tackled in this work is defined as follows: Given a CQ q, a database D, and a set Σ of DTGDs, decide whether D ∪ Σ |= q. If the input query is an ACQ, then the above problem is called ACQ answering. Analogously, the problem UACQ answering for unions of ACQs is defined. The data complexity of the above problems is calculated taking only the database as input. For the combined complexity, the query and set of DTGDs count as part of the input as well. Disjunctive Chase. We employ ∨n the disjunctive chase [17]. Consider an instance I, and a DTGD σ : φ(X) → i=1 ∃Y ψi (X, Y). We say that σ is applicable to I if there exists a homomorphism h such that h(φ(X)) ⊆ I, and the result of applying σ to I with h is the set {I1 , . . . , In }, where Ii = I ∪ h′ (ψi (X, Y)), for each i ∈ [n], and h′ ⊇ h is such that h′ (Y ) is a “fresh” null not occurring in I, for each Y ∈ Y. For such an application of a DTGD, which defines a single DTGD chase step, we write I⟨σ, h⟩{I1 , . . . , In }. A disjunctive chase tree of a database D and a set Σ of DTGDs is a (possibly infinite) tree such that the root is D, and for every node I, assuming that {I1 , . . . , In } are the children of I, there exists σ ∈ Σ and a homomorphism h such that I⟨σ, h⟩{I1 , . . . , In }. The disjunctive chase algorithm for D and Σ consists of an ex- haustive application of DTGD chase steps in a fair fashion, which leads to a disjunctive chase tree T of D and Σ; we denote by chase(D, Σ) the set {I | I is a leaf of T }. It is well-known that, given a UCQ Q, D ∪ Σ |= Q iff I |= Q, for each I ∈ chase(D, Σ). The Guarded Fragment of First-Order Logic. The guarded fragment (GFO) has been introduced in [2]. The set of GFO formulas over a schema R is the smallest set (1) containing all atomic R-formulas and equalities; (2) closed under the logical connec- tives ¬, ∧, ∨, →; and (3) if a is an R-atom containing all the variables of X ∪ Y, and φ is a GFO formula with free variables contained in (X ∪ Y), then ∀X(a → φ) and ∃X(a∧φ) are GFO formulas. It is known that the problem of deciding whether an GFO sentence is satisfiable is 2E XP T IME-complete, and E XP T IME-complete for predicates of bounded arity [22]. 3 Answering (Unions of) Acyclic Queries The problem of answering (unions of) arbitrary conjunctive queries has recently been investigated in [9]. It turns out that even for fixed sets of very simple forms of DTGDs, the problem is already 2E XP T IME-hard. Restricting the query language to (unions of) acyclic queries has often proven to be a way to reduce the complexity. Our goal in this section is thus to investigate if this also holds true for guarded DTGDs, and to study how exactly the complexity of (U)CQ answering under our guarded DTGDs is affected if we focus on acyclic queries. We will begin this section by summarizing our results, and then proceed with our new complexity bounds. Table 1 summarizes the complexity results for answering (U)ACQs under guarded DTGDs. Compared with the results for (U)CQs, as presented in [9], two observa- tions can be made: Firstly, the combined and the data complexity do not change if we restrict ourselves to acyclic queries. Secondly, however, the complexity decreases from 2E XP T IME-completeness to E XP T IME-completeness, in the case of predicates of bounded arity, and also if we consider a fixed set of DTGDs. The impact of restricting Combined Complexity Bounded Arity Fixed Rules Data Complexity 2E XP T IME E XP T IME E XP T IME coNP UB: [9, Thm. 1] UB: Thm. 1 UB: Thm. 1 UB: [7, Thm. 19] LB: [11, Thm. 6.1] LB: Thm. 2 LB: Thm. 2 LB: [15, Thm. 4.5] Table 1: Complexity of answering (U)ACQs under guarded DTGDs. the query language to acyclic queries is this only noticeable in these two cases. The novel results of this section follow: 1. UACQ answering under guarded DTGDs is in E XP T IME if we focus on predicates of bounded arity (Theorem 1) – this is shown by a reduction to satisfiability of a slight extension of GFO, and then exploit the fact that this problem is in E XP - T IME in case of predicates of bounded arity [22]; and 2. ACQ answering under fixed sets of guarded DTGDs is E XP T IME-hard (Theorem 2) – by simulating the behavior of an alternating linear space Turing machine. Existing results from the literature, provide us with the missing optimal upper and lower bounds, which complete the entire picture of the complexity of (U)ACQ answer- ing. Firstly, the missing upper bounds are immediately inherited from results presented in [9], in particular the combined and data complexity of guarded DTGDs, which are in 2E XP T IME and coNP, respectively. Secondly, the coNP-hardness of ACQ answering under guarded DTGDs follows from [15, Theorem 4.5] – ∧ the CQ employed in the con- struction of the proof of this result is ∃X∃Y1 ∃Y2 ∃Z1 ∃Z2 ( i∈{1,2} (pi (X, Yi ) ∧ s(Yi ) ∧ ri (X, Zi ) ∧ t(Zi ))), which is clearly acyclic. From the results of this section, we observe that the acyclicity of the query does not make our problem easier w.r.t. the combined and data complexity. However, in the cases of bounded arity and fixed sets of DTGDs, the complexity decreases from 2E XP T IME to E XP T IME. Let us now proceed with the formal proofs of our results. 3.1 Complexity Results We start this section by showing that UACQ answering under guarded DTGDs is in E X - P T IME in case of predicates of bounded arity. We can assume that each guarded DTGD ∨ σ is of the form α(X) ∧ ϕ(X) → 16i6m ∃Yψi (X, Y), where ϕ and ψi are conjunc- tions of atoms, and α is the atom that forms the guard of σ. Each ∨ such formula can now in fact be rewritten as the formula ∀X(α(X) → (ϕ(X) → 16i6m ∃Yψi (X, Y))). It can be verified that, if each ψi is a single atom, then the above formula falls into GFO. However, if ψi is a conjunction of atoms, then the formula is “almost” guarded since the existentially quantified variables are not necessarily guarded. Interestingly, the sat- isfiability algorithm proposed in [22] is able to treat formulas as the one above, even if the existentially quantified variables are not guarded, without increasing the complex- ity – this is explicitly stated in a remark on page 8 of [22]. By exploiting the above observation, it is now easy to reduce our problem to the satisfiability problem of sen- tences which are “almost” guarded, which is E XP T IME-complete in case of predicates of bounded arity [22]. Consider a database D, a set Σ of guarded DTGDs, and a UACQ Q. The database D itself contains only ground atoms and is thus trivially guarded. The set Σ can be rewritten as a sentence ΦΣ which is “almost” guarded, as explained above. Finally, the query Q, although is not guarded, it can be rewritten as a GFO sentence ΦQ due to acyclicity [20]. Thus, D∪Σ |= Q iff the “almost” GFO sentence (D∧ΦΣ ∧¬ΦQ ) is not satisfiable, and we obtain the following: Theorem 1. UACQ answering under guarded DTGDs is in E XP T IME in case of pred- icates of bounded arity. We continue this section by showing that ACQ answering under fixed sets of guarded DTGDs is E XP T IME-hard. The construction simulates an alternating L IN S PACE Turing machine. Using a fixed set of rules, chains of non-deterministic length can be generated, where the length represents a number. By linking these chains appropriately, configu- rations of the Turing machine can be guessed. An acyclic query is then used to check consistency w.r.t. the configuration and the transition function. The desired result fol- lows since alternating L IN S PACE already coincides with E XP T IME. Theorem 2. ACQ answering under fixed sets of guarded DTGDs is E XP T IME-hard. Proof. The proof is by a reduction from the non-acceptance of an alternating linear space Turing machine M on input I. Let M = (S, Λ, δ, s0 ), where S = S∀ ⊎ S∃ ⊎ {sa } ⊎ {sr } is a finite set of states partitioned into universal states, existential states, an accepting state and a rejecting state, Λ = {0, 1, ⊔} is the tape alphabet with ⊔ be the blank symbol, δ : S×Λ → (S×Λ×{−1, 0, +1})2 is the transition function, and s0 ∈ S is the initial state. We assume that M is well-behaved and never tries to read beyond its tape boundaries, always halts, and uses exactly n tape cells, where n = O(|I|). Fur- thermore, we assume that a rejecting configuration does not have a subsequent config- uration, while an accepting configuration has only itself as a subsequent configuration. Finally, we assume that s0 ∈ S∃ , and also that every universal configuration is followed by two existential configurations and vice versa. The above assumptions can be made, without sacrificing the generality of our proof, since the non-acceptance problem of M remains E XP T IME-hard. We are going to construct a database D, a set Σ of DTGDs, and a UACQ Q such that D ∪ Σ |= Q iff M rejects I, and Σ is a fixed set of guarded DTGDs. By [9, Lemma 4], the obtained lower bound holds even if we focus on ACQs. The idea of the proof is to construct trees, which encode possible computation trees of M on the input string I, by chasing D with Σ, and then exploit Q to check their consis- tency. On each configuration node v, which represents the configuration Cv of M , we attach a chain of length n, which mimics the tape in Cv , and a chain of length at most |S|, which encodes the state of Cv . The formal definition of D, Σ and Q follows. The Database D. Let D = {conf ∃ (c)}, where c ∈ C is a special constant which represents the initial configuration (which is, by assumption, existential). The Set Σ. The predicates that we are going to use are self-explanatory, and thus we proceed with the construction of Σ without describing the predicates of sch(Σ). – Each existential (universal) configuration has one (two) successor configuration(s) which is (are) universal (existential): conf ∃ (X) → ∃Y succ(X, Y ), conf ∀ (Y ), conf ∀ (X) → ∃Y ∃Z succ 1 (X, Y ), conf ∃ (Y ), succ 2 (X, Z), conf ∃ (Z). – Each configuration has a cell- and state-chain which simulates its tape and encodes its state, respectively: conf x (X) → ∃Y cell (X, Y ), for each x ∈ {∃, ∀}, cell (X, Y ) → ∃Z cell (Y, Z) ∨ end (Y ), conf x (X) → ∃Y state(X, Y ), for each x ∈ {∃, ∀}, state(X, Y ) → ∃Z state(Y, Z) ∨ end (Y ). – Finally, we guess the content of each cell and the cursor position: cell (X, Y ) → 0(Y ) ∨ 1(Y ) ∨ ⊔(Y ), cell (X, Y ) → cursor (Y ) ∨ notCursor (Y ). The construction of Σ is now completed. It is easy to see that Σ does not depend on M , and also that Σ is a set of guarded DTGDs. The UACQ Q. We now proceed with the construction of Q which ensures that each model of D ∪ Σ encodes a valid computation tree. Roughly, Q consists of the following disjuncts: 1. Qinitial for checking that in the initial configuration the tape contains the input string I = a1 . . . ak , the state is s0 , and the cursor is at the first cell; 2. Qlength for checking that cell-chains are of length n, and state-chains are of length at most |S|; 3. Qcursor for checking that exactly one cell is pointed by the cursor; 4. Qtrans for checking the consistency w.r.t. the transition function of M ; and 5. Qinertia for checking that the tape cells not under the cursor keep their old value during a transition. We assume a fixed order on S. Given a state s ∈ S, let #s be its position in this order. The length of the state-chain which encodes s is #s . For a predicate p ∈ {cell , state}, pi (X, Y ) is used as a shorthand for a chain of length i along predicate p, namely ∃Z1 . . . ∃Zi−1 (p(X, Z1 ) ∧ . . . ∧ p(Zi−1 , Y )). A useful subquery that we will use is ( ) length[p]k (X) ≡ ∃Y pk (X, Y ) ∧ end (Y ) stating that there exists a p-chain of length k. Having all the necessary ingredients in place, we are now ready to formalize the components of Q. 1. Qinitial is defined as (recall that c ∈ C represents the initial configuration) ∨ ∨ ( ) ∃X conf ∃ (c) ∧ cell i (c, X) ∧ a(X) ∨ i∈[n] a∈Λ\{ai } ∨ (conf ∃ (c) ∧ length[state]#s (c)) ∨ s∈S\{s0 } ∨ ( ) ∃X conf ∃ (c) ∧ cell i (c, X) ∧ head (X) . 1