Space Bounds in Ontological Reasoning via Extended Bell Numbers (DISCUSSION PAPER) Cinzia Marte[0000−0003−3920−8186] University of Calabria, Rende, Italy marte@mat.unical.it Abstract. In ontology-based query answering a user query is typically evaluated over instances containing both known and anonymous indi- viduals. In this context, the algebraic notion of isomorphism is relevant in many application scenarios, such as the so-called parsimonious chase. Two atoms are isomorphic if there is a bijection among them that, in addition, is the identity on the known individuals. A naive upper bound on the maximum cardinality of any instance containing non-isomorphic atoms is well-known from the literature. However, there are cases in which this bound is far from being optimal. This paper generalizes Bell numbers to provide a tight bound in the above setting. The main re- sult is also relevant in classical databases by characterizing the family of non-equivalent atomic queries. Keywords: ontology-based query answering · existential rules · parsimonious chase · Bell numbers · atomic queries. 1 Introduction Ontology-Based Query Answering (OBQA) consists in querying databases by taking ontological knowledge into account. It is a fascinating research topic deeply studied not only in database theory [1,13], but also in artificial intelli- gence [6,5,11] and in logic [2,3,12]. Moreover, OBQA is strictly related to others important application areas such as data integration [21], data exchange [9], and consistent query answering [23,24]. In particular, OBQA is the problem of an- swering a query q against a logical theory consisting of an extensional database D paired with an ontology Σ. The goal is to find certain answers to q, i.e. the query must be true in every possible model of the theory [20,7]. Here, we fo- cus on ontologies expressed via existential rules, also known as tuple generating dependencies (TGDs) or datalog∃ rules. They are at the core of Datalog± [14], an emerging family of ontology languages, which collects the basic decidable Copyright c 2019 for the individual papers by the papers’ authors. Copying per- mitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. classes of TGDs, and generalizes several ontology specification languages such as Description Logics (DLs) [10]. Indeed, datalog∃ generalizes the well-known language Datalog [16] with existential quantification in the head. OBQA can be reduced to the problem of answering q over a universal model U that can be homomorphically embedded into every other model of the logical theory. A way to compute a universal model is to employ the so called chase procedure. Starting from D, the chase “repairs” violations of rules by repeatedly adding new atoms –introducing fresh values, called nulls, whenever required by an existential variable– until a fixed point satisfying all rules is reached. There- fore, in the classical setting, the chase is sound and complete. But, unfortunately, the chase does not always terminates [17,18]. Recently, in [22] a new class of datalog∃ ontologies, called Shy, has been singled out for existential rules. It enjoys a new semantic property called parsimony and results in a powerful and decidable class that combines positive aspects of different Datalog± classes [4]. The parsimony property is based on the parsi- monious chase (pchase) procedure that repairs violations of rules only if the (inferred) head atom can not be homomorphically mapped to any atom previ- ously produced. For some classes of Datalog± , the parsimony property is sound and complete with respect to atomic query answering. Moreover, the termina- tion of the pchase is always guaranteed, and computational complexity has been studied [22]. Understanding the nature of the pchase procedure can lead to ob- tain practical improvements of existing implementations [22]. Indeed, so far, the research has been focused on establishing the termination of the pchase proce- dure, thus providing just a very rough upper bound of its maximal size, without any understanding of the relations between the logical theory and the atoms generated by the pchase. We fill this gap deepening this relevant connection. In particular, we need to understand what kind of atoms belong to the pchase. This has as side effect and it is strictly related to count the number of atoms gen- erated by the pchase. An immediate consequence of this better understanding of the chase leads to re-prove computational complexity results, as we improve the upper bounds previously identified in [22]. Indeed, these are very large, as they are aimed at demonstrating computational complexity, and not how many atoms can be produced by the chase. In this paper, we present contents already set out and published in [8]. In par- ticular, we provide an exact upper bound for the pchase. To this end, we exploit the notion of “equality type” defined in [19], which we show to be strictly related to the form of non-isomorphic atoms of a given predicate. Then, by exploiting the notion of Bell numbers, counting the number of distinct partitions of a finite set, we compute an upper bound for the number of atoms generating by the pchase procedure and we show that there exists a family of ontologies for which the pchase can produce exactly the upper bound previously computed, so that it corresponds to the maximal number of atoms effectively generated by the pchase procedure. Finally, as a corollary, our estimation of the maximum cardinality of any instance containing non-isomorphic atoms also provides a tight bound on the maximum number of non-equivalent atomic queries over a given relation, where two queries are considered equivalent if they give the same answers on every database. 2 Preliminaries Throughout this paper we use the following notation. Let ∆ = ∆C ∪ ∆N ∪ ∆V the domain of the terms, consisting of the union of the three countably infinite domains of constants, nulls and variables, respectively. We write ϕ to denote a null; X a variable; a an atom, that is an expression of the form p(t), where p = pred(a) is a predicate, t = t1 , . . . , tk is a tuple of terms, k = arity(a) is the arity of a or p, and a[i] is the i-th term of a. Moreover, const(a) (resp., vars(a)) is the set of constants (resp., variables) occurring in a. The set of predicates is denoted by R. Let T ⊆ ∆ a nonempty subset, then the set of all atoms that can be formed with predicates of R and terms from T is denoted by base(T ). Moreover, any subset of base(∆C ∪ ∆N ) constitutes an instance I, and whenever I ⊆ base(∆C ), then it is also called database. A substitution is a total mapping s : ∆ → ∆. Let χ1 and χ2 be two structures containing atoms. An homomorphism h : χ1 → χ2 is a substitution such that: (1) if c ∈ ∆C , then h(c) = c; (ii) if ϕ ∈ ∆N , then h(ϕ) ∈ ∆C ∪ ∆N ; (iii) h(χ1 ) is a substructure of χ2 . An existential rule r is a logical implication of the form ∀X∀Y(∃Z a(X, Z) ← φ(X, Y)), where X, Y, and Z denote sets of variables; head(r) = a(X, Z), while body(r) = φ(X, Y) is a conjunction of atoms and can also be empty. We define a datalog∃ program P as a finite set of existential rules, called ontology and denoted by dep(P ) (dependencies of P ), paired with a database instance, denoted by data(P ). Moreover, pred(P ) (resp., const(P )) represents the set of predicates (resp., constants) occurring in (P ) and arity(P ) is the maximum arity over pred(P ). Given an instance I, we say that a rule r is satisfied by I if whenever there is a homomorphism h : body(r) → I, there is a homomorphism h0 ⊃ h|vars(body(r)) s.t. h0 : head(r) → I. An instance I is a model of a program P if each rule of dep(P ) is satisfied by I, and data(P ) ⊆ I. A firing homomorphism for r and I is any homomorphism h : body(r) → I s.t. h = h|vars(body(r)) . The fire of r via h produces the atom f ire(r, h) = σ(h(head(r))), where σ = σ|vars(h(head(r))) (i.e., it replaces each existential variable of r with a different fresh null). Given a firing homomorphism h for a rule r and an instance I, we say that the pair hr, hi satisfies the parsimonious fire condition w.r.t. an instance I 0 ⊇ I if there is no homomorphism from {h(head(r))} to I 0 . Finally, given a datalog∃ program P , the parsimonious chase (pchase) of P (pchase(P )) is constructed as follows. We start from I 0 = data(P ) and create a copy of it in I. Then, for each r in dep(P ), for each unspent firing homomorphism h for the pair hr, Ii we add the f ire(r, h) to I 0 if hr, hi satisfies the parsimonious fire condition w.r.t. I 0 . If I 6= I 0 , we create a new copy of I 0 and repeat the previous steps. Otherwise, we return I. 3 Parsimonious Chase Estimation In this section, we introduce some basic notions that will help us to find a tight upper bound for the pchase. We highlight a main property of the pchase, based on isomorphic atoms, a crucial notion in several Datalog± classes [15]. Theorem 1. Given a program P , pchase(P ) does not contain isomorphic atoms. Proof. Assume, by contradiction, that there are two isomorphic atoms a and a0 in pchase(P ). Thus, there is a homomorphism h from {a} to {a0 } s.t. h−1 is a homomorphism from {a0 } to {a}. W.l.o.g. assume that a ∈ I, for some I generated during the pchase procedure. As a0 ∈ pchase(P ), then there is a rule r, an instance I 0 ⊇ I, and an unspent firing homomorphism h0 for hr, I 0 i, s.t. f ire(r, h0 ) = a0 , against the fact that h−1 ◦ σ is a homomorphism from {h0 (head(r))} to I 0 . Indeed, (h−1 ◦ σ)(h0 (head(r)) = h−1 (σ(h0 (head(r)) = h−1 (f ire(r, h0 )) = h−1 (a0 ) = a ∈ I ⊆ I 0 . t u To provide a precise upper bound for the number of steps execute by the pchase, we introduce the concept of type that is equivalent to the notion of equality type defined in. Definition 1 (Type). Let m be a positive integer, S an arbitrary partition of {1, . . . , m}, C a set with |C| ≤ |S|, and f : C → S an injective map. We define the type of S, C and f as the family of sets T (S, C, f ) = s ∪ f −1 (s) | s ∈ S .   Example 1. Let m = 6, C = {c1 , c2 }, and let S = {1, 2}, {3, 6}, {4}, {5} be a partition of {1, . . . , 6}. Consider the injective  map f : C → S such that f (c1 ) = {3, 6} and f (c2 ) = {5}. Then, T (S, C, f ) = {1, 2}, {3, 6, c1 }, {4}, {5, c2 } . Fixed an integer m, our aim is to count the number of all possible types that can be generated from any partition of the set {1, . . . , m}, by varying C on a superset D of a fixed size d. In order to do this, we resort to the Bell number Bn , that is the number of ways to partition a set of n labeled elements. Theorem 2. Let m ∈ N, D a finite set of size d > 0, and Bn the n-th Bell number. Hence, the number of all possible types generated from Pmall the partitions d of the set {1, . . . , m} and all subsets of D is given by γm = h=0 m h h d Bm−h . Proof Sketch. Recall that, given two sets A and B with |A| = α ≤ |B| = β, the β! number of injective maps from A to B is (β−α)! . Then, fixed a partition S of {1, ..., m} with |S| = s, the number of injective maps from any subset C⊆ D to s! S, with |C| = c ≤ s, is (s−c)! , while the number of subsets of size c is dc . Thus, Pmin{s,d} d s! the number of all possible types for the fixed partition S is c=0 c · (s−c)! . Hence, the number of types generated from all the partitions of the set {1, . . . , m} Pm Pmin{s,d} d s! and all subsets of D is given by s=1 S(m, s)· c=0 c · (s−c)! , where S(m, s) is the Stirling number counting the number of partitions of size s on m elements. d It can be shown that it is equivalent to γm . t u Taking advantage of the notion of type, we can provide a new representation of an arbitrary atom. Definition 2 (Atom Type). Given an atom a = p(t) of arity m, we  define the type of the atom a as Ta = T (S, C, f ), where C = const(a); S = {n | a[n] = ti } | i = 1, . . . , m ; and f : C → S such that f (c) = {n | a[n] = c}. Hence, the type of an atom a has the form Ta = {σ(t1 ), . . . , σ(tm )}, where σ is such that σ(ti ) = {n | n ∈ {1, . . . , m} ∧ a[n] = ti } ∪ {ti } if ti is a constant, and σ(ti ) = {n | n ∈ {1, . . . , m} ∧ a[n] = ti } otherwise. Intuitively, the type of an atom is formed by the sets of positions where a term occurs, by highlighting positions where constants occur.  2. Let a = p1 (ϕ1 , ϕ3 , ϕ2, ϕ1 ) and b = p2 (c, ϕ1 , d, c, ϕ2 , ϕ2 , ϕ1 ). Then, Example Ta = {1, 4}, {2}, {3} and Tb = {1, 4, c}, {2, 7}, {3, d}, {5, 6} . Theorem 3. Let a = p(t1 , . . . , tk ) and a0 = p(t01 , . . . , t0k ) be two atoms. Then, a and a0 are isomorphic if, and only if, pred(a) = pred(a0 ) and Ta = Ta0 . Proof. Let us consider two atoms a and a0 . If pred(a) 6= pred(a0 ) or arity(a) 6= arity(a0 ), then of course can not exists an isomorphism between them. Hence, we can take for granted that the two atoms have same predicate and arity. [⇒] Assume that there is an isomorphism between a and a0 , i.e., there is a homo- morphism h : {a} → {a0 } s.t. h(ti ) = t0i , i = 1, . . . , k and s.t. h−1 : {a0 } → {a} is a homomorphism. Let Ta = {σ(t1 ), . . . , σ(tk )} and Ta0 = {σ 0 (t1 ), . . . , σ 0 (tk )}. We claim that σ(ti ) = σ(t0i ), for i = 1, . . . , k. Assume that σ(ti ) ⊆ σ(t0i ), and let n ∈ σ(ti ) ∩ N, so that a[n] = ti . Therefore, we have that t0i = h(ti ) = h(a[n]) = h(a)[n] = a0 [n]. Hence, n ∈ σ(t0i ). Moreover, if n = c is a constant, by definition of homomorphism, we have c ∈ σ(ti ) ⇒ ti = c ⇒ t0i = h(ti ) = ti = c ⇒ c ∈ σ(t0i ). The reverse inclusion can be easily proved by replacing h by h−1 . [⇐] Let us assume that Ta = Ta0 . Let h : {a} → {a0 } be s.t. h(ti ) = t0i . First, we prove that h is a homomorphism. Let ti = c be a constant. Suppose that c ∈ σ(ti ), then by assumption c ∈ σ(t0i ), hence t0i = c. It remains to be shown that h is also injective. Let t0i = t0j . Then, σ(t0i ) = σ(t0j ) ⇒ σ(ti ) = σ(tj ) ⇒ ti = tj . t u Now, we are able to provide an upper bound for the maximum number of atoms generating by the pchase procedure. Theorem 4. Let P be a program with arity(P ) = w, |const(P )| = Pd, and lm the w d number of predicates in pred(P ) of arity m. Then, |pchase(P )| ≤ m=0 lm γm . Proof. By Theorem 2 and Theorem 3, the total P number of non isomorphic atoms w d over pred(P ) and const(P ) ∪ ∆N is given by m=0 lm γm . Moreover, by The- orem 1, we knowPw that pchase(P ) does not contain isomorphic atoms. Hence, d |pchase(P )| ≤ m=0 lm γm . t u Let Γwd be the upper bound in Theorem 4. To show that it is also tight, we introduce an ordering on types that will allow us to build a program with a sequence of firing homomorphisms generating a pchase of size exactly Γwd . Definition 3 (Type ordering). Let T = T (S, C, f ) and T 0 = T (S 0 , C 0 , f 0 ). Then, T precedes T 0 , if (i) |C| < |C 0 |, or (ii) |C| = |C 0 | and |S| > |S 0 |. Intuitively, such a program should have a rule for each possible atom tag, when- ever constants are allowed in the rules. Otherwise, we need a predicate to collect all constants of the database. To better understand our idea, we give an example of such a program before we provide the formal result. Example 3. Let C be a finite set of constant, and P be a program such that data(P ) = {t(c1 ), t(c2 )}, and dep(P ) is given by ∃X, Y, Zp(X, Y, Z) ∃X, Y p(Z, X, Y ) ← t(Z) ∃Xp(X, Y, Z) ← t(Y ), t(Z) ∃X, Y p(X, X, Y ) ∃X, Y p(X, Z, Y ) ← t(Z) ∃Xp(Y, X, Z) ← t(Y ), t(Z) ∃X, Y p(X, Y, X) ∃X, Yp(X, Y, Z) ← t(Z) ∃Xp(Y, Z, X) ← t(Y ), t(Z) ∃X, Y p(X, Y, Y ) ∃Xp(X, X, Y ) ← t(Y ) p(X, Y, Z) ← t(X), t(Y ), t(Z) ∃Xp(X, X, X) ∃Xp(X, Y, X) ← t(Y ) ∃Xp(Y, X, X) ← t(Y ) ∃Xt(X) We build pchase(P ) by starting from rules in the first column from top to bot- tom. For each rule r in this ordering, we consider all firing homomorphism h for r. E.g., the rule in bold produces the atoms {p(ϕ1 , ϕ2 , c1 ), p(ϕ3 , ϕ4 , c2 )}. Thus, the number of atoms with predicate p generated by the pchase will be 37 = γ32 , and |pchase(P )| = 40 = γ32 + γ12 . Theorem 5. Let w be a positive integer, D a set of constants of size d, and Γwd as above. Then, there is a family Pw of programs s.t. |pchase(Pw )| = Γwd . Proof Sketch. We build a program Pw having two predicates p (of arity w) and t (of arity 1). We set data(Pw ) = {t(c) | c ∈ D}, and define dep(Pw ) as follows. Given a partition Si = {Λ1 , . . . , Λn } of w, where n = |Si |, we construct a rule ri with an empty body, by adding X1 , . . . , Xn existential variables so that Λj = {k | p[k] = Xj , k ∈ [w]}. Now, fixed a rule ri with n > 1 existential variables, we produce n − 1 blocks of rules as follows. We translate j existential variables into universal ones, by adding j atoms over predicate t in the body. Hence, we construct nj rules. Then, we add the rules p(X1 , . . . , Xw ) = t(X1 ), . . . , t(Xw ), and ∃Xt(X). Finally, we remove all rules having in the head more than one repeated universal variable. To prove that |pchase(Pw )| = Γwd , we provide a sequence of Γwd −d firing homomorphisms. To each rule r in dep(Pw ) we associate uniquely an atom g(head(r)), where g maps existential variables to fresh nulls, and universal variables to a fixed constant. The type ordering on the atoms gives an ordering on the rules, and so to the sequence of firing homomorphisms. t u 4 Discussion and Future Work In this work, we identified the maximal number of distinct atoms generable d by the pchase procedure. In particular, γm improves the bound given in [22], that is (d + m) . In particular, d ≤ γm ≤ (d + m)m . Since in the OBQA m m d context, normally, d is much bigger than m, it could seem that the effort to find such a precise upper bound can be useless for practical purposes. However, this is not the case, as shown in the paper. Indeed, the search for a precise upper bound led to identify the fundamental notions of type and type ordering that highlighted some qualitative characteristics of the pchase. Moreover, there could be other contexts where m is much bigger than d (think for example to scenarios where tuples encode strings over a certain alphabet, as in complexity proofs based on Turing Machine simulation). In this cases, our bound represents a concrete improvement. As future work, we plan to extend the pchase condition to rules with a complex head, and to compute the maximal number of distinct atoms generable in this case. Then, we will try to analyze the orderings of fire homomorphisms in the generation of the pchase to understand if we can identify a sort of best ordering that minimizes the number of atoms produced. Finally, we will try to apply this metodology to give exact estimations of others chase versions. References 1. Alviano, M., Pieris, A.: Default negation for non-guarded existential rules. In: Proc of PODS (2015) 2. Amendola, G., Leone, N., Manna, M.: Finite model reasoning over existential rules. TPLP 17(5-6), 726–743 (2017) 3. Amendola, G., Leone, N., Manna, M.: Querying finite or arbitrary models? no matter! existential rules may rely on both once again (discussion paper). In: SEBD. CEUR Workshop Proceedings, vol. 2037, p. 218. CEUR-WS.org (2017) 4. Amendola, G., Leone, N., Manna, M.: Finite controllability of conjunctive query answering with existential rules: Two steps forward. In: IJCAI. pp. 5189–5193. ijcai.org (2018) 5. Amendola, G., Leone, N., Manna, M., Veltri, P.: Reasoning on anonymity in datalog+/-. In: ICLP (Technical Communications). OASICS, vol. 58, pp. 3:1–3:5. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2017) 6. Amendola, G., Leone, N., Manna, M., Veltri, P.: Enhancing existential rules by closed-world variables. In: IJCAI. pp. 1676–1682. ijcai.org (2018) 7. Amendola, G., Libkin, L.: Explainable certain answers. In: IJCAI. pp. 1683–1690. ijcai.org (2018) 8. Amendola, G., Marte, C.: Extending bell numbers for parsimonious chase esti- mation. In: JELIA. Lecture Notes in Computer Science, vol. 11468, pp. 490–497. Springer (2019) 9. Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Foundations of Data Exchange. Cambridge University Press (2014) 10. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook. Cambridge University Press (2003) 11. Baget, J., Leclère, M., Mugnier, M., Salvat, E.: On rules with existential variables: Walking the decidability line. Artif. Intell. 175(9-10), 1620–1654 (2011) 12. Bárány, V., Gottlob, G., Martin Otto: Querying the guarded fragment. LMCS 10(2) (2014) 13. Bourhis, P., Manna, M., Morak, M., Pieris, A.: Guarded-based disjunctive tuple- generating dependencies. ACM TODS 41(4) (2016) 14. Calı̀, A., Gottlob, G., Lukasiewicz, T.: Datalog± : a unified approach to ontologies and integrity constraints. In: Proc. of ICDT (2009) 15. Calı̀, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. In: PODS. pp. 77–86. ACM (2009) 16. Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1(1), 146–166 (1989) 17. Deutsch, A., Nash, A., Remmel, J.B.: The chase revisited. In: Proc. of PODS (2008) 18. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. TCS 336(1), 89–124 (2005) 19. Gottlob, G., Orsi, G., Pieris, A.: Ontological queries: Rewriting and optimization. In: ICDE. pp. 2–13. IEEE Computer Society (2011) 20. Imielinski, T., Lipski, W.: Incomplete information in relational databases. J. ACM 31(4), 761–791 (1984) 21. Lenzerini, M.: Data integration: A theoretical perspective. In: PODS. pp. 233–246. ACM (2002) 22. Leone, N., Manna, M., Terracina, G., Veltri, P.: Efficiently computable Datalog∃ programs. In: Proc. of KR (2012) 23. Manna, M., Ricca, F., Terracina, G.: Consistent query answering via ASP from different perspectives: Theory and practice. TPLP 13(2), 227–252 (2013) 24. Manna, M., Ricca, F., Terracina, G.: Taming primary key violations to query large inconsistent data via ASP. TPLP 15(4-5), 696–710 (2015)