Introduction

Query Answering on Expressive Datalog Ontologies

Mostafa Milani

Andrea Cal

andrea@dcs.bbk.ac.uk 0 1

Leopoldo Bertossi

bertossig@scs.carleton.ca 2 0 Dept of Computer Science Birkbeck, Univ. of London , UK 1 Oxford-Man Institute of Quantitative Finance University of Oxford , UK 2 School of Computer Science Carleton University , Canada

Introduction

The Datalog family of ontology languages [ 1 ], which extends Datalog with existential quanti cation, has been gaining importance in the area of Ontology Querying due to its capability of capturing several prominent Semantic Web languages as well as o ering e cient query answering services in many variants relevant for applications. The core feature of Datalog languages are so-called tuple-generating dependencies (TGDs), which are the main form of rules. Such rules allow the inference of new atoms from an initial set (a database), which is captured by the notion of chase procedure. For example, consider the TGD r(X; Y ) ! 9Z s(X; Z) and a database D constituted by a single atom r(a; b). A chase step will generate the atom s(a; ), where is a labelled null, that is, a placeholder for an unknown value; notice that the constant b is lost in this step as it doesn't appear in the new atom. Conjunctive query answering under general TGDs is undecidable; languages of the Datalog family impose therefore restrictions on the form of rules so as to guarantee decidability and good computational properties. Guarded (extended by weakly-guarded ) Datalog were the rst form of decidable Datalog languages, inspired by guarded logic and characterised by the presence of a guard atom in each rule, that contains all variables of that rule.

The language called sticky Datalog [ 1 ] was introduced in order to capture a \proper" notion of join in rules, that is, the occurrence of variables in two distinct atoms of a rule in the absence of a guard for that rule. Weakly-sticky Datalog is an extension of sticky Datalog that extends sticky Datalog by also capturing the well-known class of weakly-acyclic TGDs.

Let be the following set of rules, where some body variables are marked (denoted by a hat sign, e.g. X^ ; see [ 1 ]) as the result of a procedure that identi es positions (arguments of predicates; for instance, p[ 1 ] is the position corresponding to the rst argument of the predicate p) where, in the chase procedure, values can appear that can eventually be lost in a subsequent chase step. v(X) ! 9Y r(X; Y ) p(X^ ; Y^ ) ! 9Z p(Y; Z) r(X; Y^ ); r(Y^ ; Z) ! p(X; Z): p(X^ ; Y ); p(Y; Z) ! t(Y; Z) A set of rules is sticky if there is no marked variable in a rule that appears more than once in the body of the rule. Intuitively, stickiness can be de ned by means of the following property: during a chase step according to a rule , each value corresponding to a variable appearing more than once in is not lost in the chase step, and it is also never lost in any subsequent step involving atoms where it appears. is not sticky, as easily seen.

Weak-acyclicity is de ned using the notion of rank of a position [ 2 ]. A position has nite (resp., in nite) rank if the number of labelled nulls that can appear in that position in the chase procedure is nite (in nite). In the Datalog program above, F = fv[ 1 ]; r[ 1 ]; r[ 2 ]g is the set of positions with nite rank and 1 = fp[ 1 ]; p[ 2 ]g is the set of positions with in nite rank. A Datalog program is weakly-acyclic if all positions of predicates have nite rank. The program is not weakly-acyclic.

A set of rules is weakly-sticky if every marked variable that is repeated in the body of a rule appears at least once in a position with nite rank. This generalizes stickiness with acyclicity because the stickiness is only applied on values that appear in the positions with in nite rank. Here, is weakly-sticky. Speci cally in the second rule, the repeated variable Y appears in the positions r[ 1 ] and r[ 2 ] in F . In the last rule, Y is a repeated variable but not marked.

So far, no non-trivial deterministic algorithm for CQ answering has been devised for weakly-sticky Datalog . In this paper we set the basis for the implementation of conjunctive query (CQ) answering by the following contributions. 1. We propose a bottom-up technique, where we ground the rules of according to the database D in a suitable way. The expansion instance terminates at a point dependent on the size of the query; the query can be then evaluated directly on the expanded instance. 2. We propose a hybrid approach to CQ answering as follows. First, we transform all positions in F into positions of rank 0 (that is, positions where only constants can appear); then certain variables in such positions are grounded, that is, the variables are replaced by selected constants of the initial instance. The obtained program is a sticky program, for which a well-known query rewriting technique for CQ answering can be applied. 2

A Grounding Approach

A rst deterministic algorithm for query answering on a weakly-sticky Datalog program and a database is obtained by grounding the rules of in a bottomup fashion. The procedure is inspired by the version of the chase procedure in [ 5, 6 ]. We illustrate this technique by means of an example. Consider a database D = fp(a; b); v(b)g, a Boolean conjunctive query (BCQ) q = fu(X)g, and a set of WS rules as follows: 1 : p(X; Y ) ! 9Z p(Y; Z) and 2 : v(X); p(X; Y ); p(Y; Z) ! u(Y ).

We start from D and iteratively generate ground rules by mapping via homomorphism the body of the rules in into D or the head of the rules in 0 (the current set of ground rules). The basic procedure is as follows: we iteratively add a ground rule to 0 if its head atom is not homomorphic to the head of a rule already in 0, until no new rule can be added. This \cautious" procedure, similar to the chase procedure in [ 5, 6 ], guarantees its termination. In our example, the algorithm stops after adding only one ground rule p(a; b) ! p(b; 1) to 0, where 1 is a labelled null.

In order to complete our grounding, we resume the above basic procedure Mq times, where Mq is the number of existential variables in q. Before each resumption, we freeze the labelled nulls, which after having been frozen are considered as constants. In our example, we have only one more resumption, which adds the rules, p(b; 1) ! p( 1; 2) and v(b); p(b; 1); p( 1; 2) ! u( 1). Resumptions are needed, intuitively, to capture applications of rules (in a chase procedure) where a join variable, appearing in two or more distinct atoms, are mapped to a labelled null.

Now, the number of resumptions depends on the query, which makes our grounding dependent on the query; however, for practical purposes, we could ground with N resumptions so as to be able to answer queries with up to N existential variables, and if a query with more than N existential variables is to be answered, we can incrementally retake the already-computed grounding and add the required number of resumptions. 3

A Hybrid Approach

In this section, we propose an algorithm based on a hybrid approach that combines the grounding of speci c variables in rules with the query rewriting algorithm from [ 3 ]. Given a set of WS rules and a database D, our algorithm transforms according to D into a sticky set of rules 0. Sticky Datalog enjoys rst-order rewritability, that is, each CQ can be rewritten into a rst-order query and directly answered on the database [ 3 ], providing the correct answer.

As an example, consider D = fv(a); s(a; b)g and a set of WS rules that includes 1 and 2 from the example in Section 2 as well as the following rules: 3 : v(X) ! 9Y r(X; Y ): 4 : r(X; Y ); s(X; Z) ! t(Y; Z): 5 : r(X; Y ); t(Y; Z) ! p(X; Z):

Let us call weak rules the rules in which some repeated marked body variables appear at least once in a position with nite rank; we call such variables weak variables. We aim at grounding the weak variables only, thus turning the WS program into a sticky program.

In our example, 4 and 5 are weak rules with X and Y as their weak variables respectively. 4 is partially grounded as a sticky rule r(a; Y ); s(a; Z) ! r(Y; Z) since X can only take the value a from D. To (partially) ground 5 we would need Y to be replaced with a null value invented by 3. Notice that the variables we are to ground can only be in positions with nite rank, which guarantees the niteness of the partial grounding. Our algorithm consists of two phases.

(Phase 1). We remove the existential variables that are in the positions with nite rank with a method inspired by [ 4 ]. In the context of our example, we rst Skolemize 3, that is the only rule with an existential variable in a position with nite rank. The result is v(X) ! r(X; f (X)). Next, we replace r(X; f (X)) with an expanded predicate r0(X; f; X) with the higher arity that results into a rule v(X) ! r0(X; f; X) in which f is a fresh constant that represents the replaced function. Then, we iteratively propagate the expanded predicate in the body and head of the rules, possibly expanding other predicates e.g. t, obtaining the new rules: 30 : v(X) ! r0(X; f; X), 40 : r0(X; Y; Y 0); s(X; Z) ! t0(Y; Y 0; Z), and 50 : r0(X; Y; Y 0); t0(Y; Y 0; Z) ! p(X; Z). The result is a set of rules 0;1 = f 1; 2; 30; 40; 50g with positions of rank either zero or in nite.

(Phase 2). Now, we proceed with the grounding step where we replace the weak variables (X in 40 and Y; Y 0 in 50) with values from D and the new constant f. The result is a set of sticky rules 0 that includes 1, 2 and 30 as well as the following rules, 400 : r0(a; Y; Y 0); s(a; Z) ! t0(Y; Y 0; Z), and 500 : r0(X; f; a); t0(f; a; Z) ! p(X; Z).

To compute the answers to the query q, 0 and q are rewritten into a rstorder query using the rewriting method for sticky rules [ 3 ]. The rst-order query is then evaluated on D. 4

Conclusion

Weakly-sticky Datalog is an expressive ontology language with good computational properties and capable of capturing the most prominent Semantic Web languages. We proposed two deterministic algorithms for answering conjunctive queries on weakly-sticky Datalog . This, we believe, sets the basis for practical query answering algorithms for real-world scenarios. We plan to continue our work by running experiments on large data sets. We also intend to re ne the hybrid algorithm by devising solutions to reduce the number of constants of the database used to ground the weak variables; this would improve the e ciency of our approach.

Cali , G.

Gottlob, and

Pieris . Towards more expressive ontology languages: The query answering problem . Arti cial Intelligence , 2012 , 193 : 87 - 128 .

Fagin ,

P. G.

Kolaitis ,

R. J.

Miller , and

Popa . Data Exchange: Semantics and Query Answering . TCS, 2005 , 336 : 89 - 124 .

Gottlob , G.

Orsi, and

Pieris . Query Rewriting and Optimization for Ontological Databases . Proc. TODS , 2014 , 39 ( 3 ): 25 .

4. Krotzsch, M. and Rudolph , S. Extending Decidable Existential Rules by Joining Acyclicity and Guardedness . Proc. IJCAI , 2011 , pp. 963 - 968 .

Leone ,

Manna , G. Terracina, and

Veltri. E ciently Computable Datalog Programs . Proc. KR , 2012 , pp. 13 - 23 .

Milani ,

Bertossi . Tractable Query Answering and Optimization for Extensions of Weakly-Sticky Datalog+- . Proc. AMW , 2015 .