Answering Expressive Path Queries over Lightweight DL Knowledge Bases

Answering Expressive Path Queries over Lightweight DL Knowledge Bases MeghynBienvenu LRI -CNRS Université MagdalenaOrtiz Institute of Information Systems Vienna University of Technology MantasŠimkus Institute of Information Systems Vienna University of Technology ParisSud Answering Expressive Path Queries over Lightweight DL Knowledge Bases 59B2AC5B41F14FDFCF20A0C62CF144CE GROBID - A machine learning software for extracting information from scholarly documents

We establish tight complexity bounds for answering an extension of conjunctive 2-way regular path queries over EL and DL-Lite knowledge bases.

Introduction

It has been extensively argued in the description logic (DL) community that answering queries over ABoxes in the presence of ontological constraints formulated in a DL TBox is a fundamental reasoning service. In databases, similar attention has been paid to the related problem of querying graph databases, which are relational databases where only unary and binary predicates occur, or in other words, node-and edge-labeled graphs [8,3]. The relevance of both problems lies in the fact that in many application areas data can be naturally modeled as an ABox or a graph database. This applies, in particular, to XML data on the web, including RDF datasets. While the two communities share some common research goals, the research agendas they have pursued differ significantly. In the DL community, the focus has been on designing efficient algorithms for answering (plain) conjunctive queries in the presence of expressive ontological constraints. By contrast, work on graph databases typically does not consider ontological knowledge, but instead aims at supporting expressive query languages, like regular path queries (RPQs) and their extensions, which enable sophisticated navigation of paths.

This paper aims to help bridge this gap, by considering an expressive extension of RPQs, and providing algorithms and precise complexity bounds for the EL and DL-Lite families of lightweight DLs. We build on conjunctive (2-way) regular path queries (C2RPQs), which simultaneously extend plain conjunctive queries (CQs) and basic RPQs: they allow conjunctions of atoms that can share variables in arbitrary ways, where the atoms may contain regular expressions that navigate the arcs of the database (roles) in both directions. C2RPQs are one of the most expressive and popular languages for querying graph databases. These queries have already been studied for some DLs. In particular, automata-based algorithms have been proposed for the very expressive DLs ZIQ, ZIO, and ZOQ [5,6], for which query answering is 2-EXPTIME hard. Even in data complexity, that is, when the query and ontology are assumed fixed, these algorithms need exponential time. More recently, algorithms for answering C2RPQs were proposed in [11] for Horn-SHOIQ and Horn-SROIQ. They are polynomial in data complexity but still EXPTIME in combined, which is worst-case optimal for these This work partially supported by the Austrian Science Fund (FWF) grants P20840 and T515.

logics. For prominent lightweight DLs like the DL-Lite [4] and EL [2], which underly the OWL 2 profiles, queries with regular paths had not been explored and many questions remained open, like whether algorithms that require only polynomial space are possible. In DL-Lite, FO-rewritability and AC 0 data complexity are clearly lost, since we can express reachability, but it was not known whether P-hardness was avoidable. In this paper, we answer these questions by providing precise complexity bounds.

We propose an extension of C2RPQs that we call conjunctive (2-way) regular path queries with complex labels, abbreviated -C2RPQs. To illustrate its expressiveness, we consider a graph representation of the Mathematics Genealogy Project (MGP) database, which contains about 160K historic records of advisor relationships of PhD holders in mathematics and related disciplines. We use nodes for mathematicians, theses, topics of research, and universities. Nodes and edges are labeled with concepts (unary relations) and roles (binary relations), respectively. Figure 1 An ontology containing the axioms in Fig. 2 can be used to express, for example, that a person that works in computer science or that wrote a doctoral thesis in computer science is a computer scientist (1,2). In a similar way we can define other specialities such as biologists (3,4), logicians (5,6), physicists, etc. We group the first level subjects of the Mathematics Subject Classification (MSC) used in the MGP database into their 5 major areas (7)(8)(9)(10)(11), and these major areas into subjects (12-16).

( In RPQs, C2RPQs and similar languages, regular paths usually talk about the arc labels only, and do not allow to verify conditions on the node labels. For example, one can use the expression hasAdvisor * •WroteThesis • hasTopic to navigate arbitrarily long chains of advisors and visit their thesis topics, but we cannot look at the subjects and thesis topics and impose conditions on them. This limitation is not so significant for graph databases, as arc labels play a more prominent role and are often used to simulate node labels. In the DL setting, by contrast, concepts are crucial and should be treated as first-class citizens. For this reason, we add to C2RPQs the ability to talk about combinations of concept and roles that appear along a path. In our language, we can use the expression (hasAdvisor , Logician ∨CompScientist) * •WroteThesis • (hasTopic, Geometry) to navigate a chain of advisors that are computer scientist or logicians, until we reach one that wrote a doctoral thesis in Geometry. Note that this query could be expressed (less succinctly) using the test operator (cf. [6]): (hasAdvisor • Logician? ∪ hasAdvisor • CompScientist?) * • WroteThesis • hasTopic • Geometry?. However, our language is slightly more expressive (for DLs without role conjunction), since we can navigate a chain of people that are both advisors and coauthors using (hasAdvisor ∧ coAuthor ) * .

In this paper, we show that answering these queries over EL and DL-Lite knowledge bases is PSPACE-complete, but drops to NP if we consider DL-Lite RDFS . For data complexity, the problem is NLSPACE-complete for DL-Lite and P-complete for EL.

Preliminaries

We briefly recall the syntax of DL-Lite R [4] and ELH [2] (and relevant sublogics). As usual, we assume sets N C , N R , and N I of concept names, role names, and individuals. We will use As usual, the semantics is based upon interpretations, which take the form I = (∆ I , • I ), where ∆ I is a non-empty set and • I maps each a ∈ N I to a I ∈ ∆ I , each A ∈ N C to A I ⊆ ∆ I , and each r ∈ N R to r I ⊆ ∆ I × ∆ I . The function • I is straightforwardly extended to general concepts and roles, e.g. (¬A) I = ∆ I \ A I and (∃r.C)

N R to refer to N R ∪{r − | r ∈ N R }, and if R ∈ N R , we use R − to mean r − if R = r and r if R = r − . An ABox isI = {c | ∃d : (c, d) ∈ r I , d ∈ C I }. I satisfies G H if G I ⊆ H I ; it satisfies A(a) (resp. r(a, b)) if a I ∈ A I (resp. (a I , b I ) ∈ r I ). I is a model of K = (T , A) if I satisfies

all inclusions in T and assertions in A.

To simplify the presentation, we will assume that ELH TBoxes are normalized, meaning that all concept inclusions are of one of the following forms:

A A B A ∃r.B B 1 B 2 A ∃r.B A with A, B, B 1 , B 2 concept names.

It is well-known that for every ELH TBox T , one can construct in polynomial time a normalized ELH TBox T that uses fresh concept names such that T |= T and every model of T can be expanded to a model of T .

For convenience, we introduce a set of basic concepts, denoted BC, defined as follows:

BC = N C ∪ {∃R | R ∈ N R } for DL-Lite R , and BC = N C for ELH.

Canonical Models We recall the definition of canonical models for DL-Lite R and ELH KBs. For both logics, the domain of the canonical model I T ,A for a KB (T , A) will consist of paths of the form aR 1 C 1 . . . R n C n (n ≥ 0), where a ∈ Ind(A), each C i is a basic concept, and each R i a (possibly inverse) role. When T is a DL-Lite R TBox, the domain ∆ I T ,A contains exactly those paths aR 1 ∃R − 1 . . . R n ∃R − n which satisfy:

-if n ≥ 1, then T , A |= ∃R 1 (a); -for 1 ≤ i < n, T |= ∃R − i ∃R i+1 and R − i = R i+1 .

When T is a (normalized) ELH TBox, the domain ∆ I T ,A contains exactly those paths ar 1 A 1 . . . r n A n for which each r i ∈ N R , and:

if n ≥ 1, then T , A |= ∃r 1 .A 1 (a); -for 1 ≤ i < n, T |= A i ∃r i+1 .A i+1 .

We denote the last concept in a path p by tail(p), and define I T ,A by taking:

a I T ,A = a for all a ∈ Ind(A) A I T ,A = {a ∈ Ind(A) | T , A |= A(a)} ∪ {p ∈ ∆ I T ,A \ Ind(A) | T |= tail(p) A} r I T ,A = {(a, b) | r(a, b) ∈ A} ∪ {(p 1 , p 2 ) ∈ ∆ I T ,A × ∆ I T ,A | p 2 = p 1 • S C and T |= S r}∪ {(p 2 , p 1 ) ∈ ∆ I T ,A × ∆ I T ,A | p 2 = p 1 • S C and T |= S r − } Note that I T ,A

is composed of a core consisting of the ABox individuals and an anonymous part consisting of (possibly infinite) trees rooted at ABox individuals. We will use I T ,A | e to denote the submodel of I T ,A obtained by restricting the universe to paths having e as a prefix.

Regular Languages

We assume the reader is familiar with regular languages, represented either by regular expressions or nondeterministic finite state automata (NFAs). An NFA over an alphabet Σ is a tuple α = S, Σ, δ, s 0 , F , where S is a finite set of states, δ ⊆ S × Σ × S the transition relation, s 0 ∈ S the initial state, and F ⊂ S the set of final states. We use L(α) to denote the language defined by an NFA α, and when the way a regular language is represented is not relevant, we denote it simply by L.

Conjunctive Regular Path Queries with Complex Labels

We now formally introduce our query language. Definition 1. By B(S) we denote the set of all (positive) Boolean formulas built from the symbols in S ∪ {true, false} using the connectives ∧ and ∨. A conjunctive (twoway) regular path query with complex labels (abbreviated to -C2RPQ) has the form q(x) = ∃y ϕ where x and y are tuples of variables, and ϕ is a conjunction of atoms of the following forms:

(i) β(t), where β ∈ B(N C ) and t ∈ N I ∪ x ∪ y, and (ii) L(t, t ), where L is (an NFA or regular expression defining) a regular language over B(N R ) × B(N C ), and t, t ∈ N I ∪ x ∪ y.

As usual, variables and individuals are called terms, and the variables in x are called answer variables. A query with no answer variables is called Boolean. If all atoms of type (i) are of the form A(t) with A ∈ N C , and the regular languages L in atoms of type (ii) comprise only symbols of the form (R, true) with R ∈ N R , then q is called a conjunctive (two-way) regular path query (C2RPQ). Conjunctive one-way regular path queries with complex labels ( -CRPQs) and conjunctive one-way regular path queries (CRPQs) are defined analogously, the only difference being that they use formulas from B(N R ) instead of B(N R ) in atoms of type (ii). Conjunctive queries (CQs) are C2RPQs where all atoms of type (ii) have the form (R, true)(t, t ) with R ∈ N R .

Example 1. For readability, we write symbols (R, true) ∈ B(N R ) × B(N C ) simply as R. Recall the MGP database. The query q 1 (x, y) in Fig. 3 searches for pairs of computer scientists that have a biologist as common academic ancestor, and such that there is a physicist on that path. It returns, among others, Jack Minker and Edmund Clarke, who have the biologist and mathematician Johann Bernoulli (1667-1748) as common ancestor on a path including the physicist and mathematician Joseph Fourier (1768 -1830); as well as Gert Smolka and Georg Gottlob, who have as ancestors Nikolaus Poda von Neuhaus, an Austrian entomologist (1723 -1798), and the physicist Ludwig Boltzmann (1844 -1906). The query q 2 (x, y) searches for a common academic ancestor x of Robert Kowalski and Franz Baader, together with the country y where the ancestor's thesis was defended; it requires all ancestors on the path to be computer scientists and logicians. It returns one tuple: (Bernard Meltzer, UK). The query q 3 (x) is similar but we require x to be a biologist or physicist, and additionally allow physicists along the path. This query retrieves 8 people, going back to Gabriel Gruber (1740-1805), a Jesuit priest, philosopher, mathematician and professor of physics.

q1(x, y) = CompScientist ∨ Logician(x), CompScientist ∨ Logician(y), ˆhasAdvisor * • (hasAdvisor , Physicist) • hasAdvisor * • (hasAdvisor , Biologist) • (hasAdvisor − ) * • (hasAdvisor − , Physicist) • (hasAdvisor − ) * ˜(x, y) q2(x, y) = [ `hasAdvisor , CompScientist ∧ Logician ´ * ](RKowalski, x), [ `hasAdvisor , CompScientist ∧ Logician ´ * ](FBaader , x) [wroteThesis • submittedTo • locatedIn](x, y) q3(x) = [(hasAdvisor , ((CompScientist ∧ Logician) ∨ Physicist)) * • (hasAdvisor , Biologist ∨ Physicist)](RKowalski, x) [ `hasAdvisor , ((CompScientist ∧ Logician) ∨ Physicist) ´ * •(hasAdvisor )](FBaader , x)

Fig. 3: Example queries

We now define the semantics of -C2RPQs. We say that a set X ⊂ X satisfies a formula ϕ ∈ B(X), written X |= ϕ, if the formula that results from replacing each v ∈ X by true if v ∈ X and by false otherwise is equivalent to true. For a regular language L over the alphabet B(N R ) × B(N C ), we call d 2 an L-successor of d 1 in I if there is some w = (Γ 1 , Υ 1 ) . . . (Γ n , Υ n ) ∈ L and some sequence e 0 , . . . , e n of elements in ∆ I such that e 0 = d 1 , e n = d 2 , and, for all 1 ≤ i ≤ n:

{R ∈ N R | e i−1 , e i ∈ R I } |= Γ i and {A ∈ N C | e i ∈ A I } |= Υ i .

A match for a Boolean -C2RPQ q in an interpretation I is a mapping π from the terms in q to elements in ∆ I such that:

π(c) = c I if c ∈ N I , -{A ∈ N C | π(t) ∈ A I } |= β for each atom β(t)

in q, and π(t ) is an L-successor of π(t) for each atom L(t, t ) in q. We write I |= q if there is a match for q in I, and T , A |= q if I |= q for every model I of T , A.

Given an -C2RPQ q with answer variables v 1 , . . . , v k , we say that a tuple of individuals (a 1 , . . . , a k ) is a certain answer for q w.r.t. T , A just in the case that in every model I of T , A there is a match π for q such that π(v i ) = a I i for every 1 ≤ i ≤ k. Just as for CQs and C2RPQs, deciding whether a tuple of individuals is a certain answer for an -C2RPQ can be linearly reduced to Boolean -C2RPQ entailment. For this reason, we consider only the latter problem in what follows.

It is well known that the canonical model I T ,A can be homomorphically embedded into any model of T , A, hence a CQ q is entailed by T , A if and only if there is a match for q in I T ,A . This result can be easily lifted from CQs to -C2RPQs, as -C2RPQs are also monotonic and their matches are preserved under homomorphisms.

Lemma 1. For every DL-Lite R or ELH KB (T , A) and Boolean -C2RPQ q: T , A |= q if and only if I T ,A |= q. This property will be a crucial element in establishing our main theorem: Theorem 1. Boolean -C2RPQ entailment is NLSPACE-complete in data complexity and PSPACE-complete in combined complexity for DL-Lite R ; the combined complexity drops to NP-complete for DL-Lite RDFS . For ELH, the problem is P-complete in data complexity and PSPACE-complete in combined complexity. All lower bounds hold also for CRPQs and in the absence of role inclusions.

We split the proof of this theorem into parts, with the lower bounds shown in the next section, and the (more involved) proofs of the upper bounds outlined in Section 5.

Lower Bounds

We start by establishing the required lower bounds.

Proposition 1. Boolean CRPQ entailment is 1. NLSPACE-hard in data complexity for DL-Lite RDFS ; 2. P-hard in data complexity for EL; 3. NP-hard in combined complexity for DL-Lite RDFS ; 4. PSPACE-hard in combined complexity for DL-Lite and EL.

Proof. Statement (1) follows from the analogous result for graph databases [8]. It can be shown by a simple reduction from the NLSPACE-complete directed reachability problem: y is reachable from x in a directed graph G if and only if (x, y) is an answer to r * (x, y) w.r.t. the ABox A G encoding G. Statement (2) is immediate given the P-hardness in data complexity of CQ entailment in EL [7], and (3) follows from the well-known NP-hardness in combined complexity of CQ entailment for databases [1].

For statement (4), we give a reduction from the problem of emptiness of the intersection of an arbitrary number of regular languages, which is known to PSPACEcomplete [10]. Consider some regular languages L 1 , . . . , L n over alphabet Σ. We will use the symbols in Σ as role names, and we add a concept name A. Then we set A = {A(a)} and q = ∃x L 1 (a, x) ∧ . . . ∧ L n (a, x). For DL-Lite, we will use the following TBox: T = {A ∃r | r ∈ Σ} ∪ {∃r − ∃s | r, s ∈ Σ}. For EL, we can use T = {A ∃r.A | r ∈ Σ}. Notice that in both cases the canonical model I T ,A consists of an infinite tree rooted at a such that every element in the interpretation has a unique r-child for each r ∈ Σ (and no other children). Thus, we can associate to every domain element the word over Σ given by the unique path from a, and moreover, for every word w ∈ Σ * we can find an element e w whose path from a is exactly w. This means that if w ∈ L 1 ∩ . . . ∩ L n , we obtain a match for q in the canonical model by mapping x to e w . Conversely, if q is entailed, then any match in the canonical model defines a word which belongs to every L i , which means L 1 ∩. . .∩L n is non-empty.

Upper Bounds

The main objective of this section will be to define procedure for deciding I T ,A |= q for a given KB T , A and a given -C2RPQ q. The procedure comprises two main steps. First, we rewrite q into a set Q of -C2RPQs such that I T ,A |= q if and only if I T ,A |= q for some q ∈ Q. The advantage of the rewritten queries is that in order to decide whether I T ,A |= q , we will only need to consider matches which map the variables to Ind(A). The second step evaluates the rewritten queries over the core part of the canonical model involving only Ind(A).

Preliminary Notions

In order to more easily manipulate regular languages, it will prove convenient to use NFAs rather than regular expressions. Thus, in what follows, we assume all binary atoms take the form α(t, t ), where α is an NFA over B(N R )×B(N C ). Given α = S, Σ, δ, s 0 , F , we use α s,G to denote the NFA S, Σ, δ, s, G , i.e. the NFA with the same states and transitions as α but with initial state s and final states G.

A key to defining our rewriting procedure will be to understand how an atom L(t, t ) can be satisfied in the anonymous part of the canonical model I T ,A . A subtlety arises from the fact that the path witnessing the satisfaction of an atom L(t, t ) may be quite complicated: it may move both up and down, passing by the same element multiple times, and possibly descending below t . This will lead us to decompose an atom L(t, t ) into multiple "smaller" atoms corresponding to segments of the L-path which are situated wholly above or below an element. Importantly, we know that the canonical model displays a high degree of regularity, since whenever two elements p 1 and p 2 in the anonymous part end with the same concept (i.e. Tail(p 1 ) = Tail(p 2 )), the submodels I T ,A | p1 and I T ,A | p2 are isomorphic. In particular, this means that if Tail(p 1 ) = Tail(p 2 ), then p 1 is an L-successor of itself in the interpretation I A,T | p1 just in the case that p 2 is an L-successor of itself in the interpretation I A,T | p2 .

We now wish to define a way of testing for a given TBox T and NFA α with states s, s whether Tail(e) = C ensures that there is a loop from e back to itself, situated wholly within I T ,A | e , which takes α from state s to state s . To this end, we construct a table Loop α which contains for each pair s, s of states in α, a subset of BC. If T is a DL-Lite R TBox, then Loop α is defined inductively using the following rules:

(1) for every s ∈ S,

Loop α [s, s] = BC (2) if C ∈ Loop α [s 1 , s 2 ] and C ∈ Loop α [s 2 , s 3 ], then C ∈ Loop α [s 1 , s 3 ] (3) if C ∈ BC, T |= C ∃R, ∃R − ∈ Loop α [s 2 , s 3 ], (s 1 , (Γ, Υ ), s 2 ) ∈ δ, (s 3 , (Γ , Υ ), s 4 ) ∈ δ, {U ∈ N R | T |= R U } |= Γ , {A ∈ N C | T |= ∃R − A} |= Υ , {U ∈ N R | T |= R − U } |= Γ , and {A ∈ N C | T |= C A} |= Υ , then C ∈ Loop α [s 1 , s 4 ]

For ELH, we replace the third rule by:

(3') if C ∈ BC, T |= C ∃r.D, D ∈ Loop α [s 2 , s 3 ], (s 1 , (Γ, Υ ), s 2 ) ∈ δ, (s 3 , (Γ , Υ ), s 4 ) ∈ δ, {s ∈ N R | T |= r s} |= Γ , {A ∈ N C | T |= D A} |= Υ , {s − ∈ N R | T |= r s} |= Γ , and {A ∈ N C | T |= C A} |= Υ , then C ∈ Loop α [s 1 , s 4 ]

Note that the table Loop α can be constructed in polynomial time in |T | and |α| since entailment of inclusions is polynomial for both DL-Lite R and ELH. The following lemma shows that Loop α has the desired meaning:

Lemma 2. For every element p ∈ ∆ I A,T \ Ind(A): Tail(p) ∈ Loop α [s, s ] if and only if p is an L(α s,s )-successor of itself in the interpretation I A,T | p .

Query Rewriting Our aim is to rewrite our query in such a way that we do not need to map any variables to the anonymous part of the model. We draw our inspiration from a query rewriting procedure for Horn-SHIQ described in [9]. The main intuition is as follows. Suppose we have a match π for q which maps some variable y to the anonymous part, and no other variable is mapped below π(y). Then we modify q so that it has essentially the same match except that variables mapped to π(y) are now mapped to the (unique) parent of π(y) in I T ,A . The delicate point is that we must "split" atoms of the form α(t, t ) with y ∈ {t, t } into the parts which are satisfied in the subtree I T ,A | π(y) (these have already been shown to hold, so can be dropped), and those which occur above π(y), whose satisfaction still needs to be determined and thus must be incorporated into the new query. With each iteration of the rewriting procedure, we obtain a query which has a match which maps variables "closer" to the core of I T ,A , until eventually we find some query that has a match which maps all terms to Ind(A).

We now give a recursive non-deterministic query rewriting procedure which implements the above intuition. PROCEDURE rewrite(q, T ) 1. Choose either to output q or to continue. 2. Choose a non-empty set Leaf ⊆ vars(q) and y ∈ Leaf. Rename all variables in Leaf to y. 3. Choose some concept C ∈ BC such that {B | T |= C B} |= ϕ for every atom ϕ(y). Drop all such atoms from q. 4. For each atom α(t, t ) where α = S, Σ, δ, s, F is a NFA and y ∈ {t, t }, choose a sequence s 1 , . . . s n of distinct states from S such that s n ∈ F , replace the atom α(t, t ) in q by the atoms α s,s1 (t, y), α s1,s2 (y, y), . . . , α sn−2,sn−1 (y, y), α sn−1,sn (y, t ).

Replace

each atom α(y, x) with x = y by atoms α s ,s f (y, x) and Υ (y) each atom α(x, y) with x = y by α s,s (x, y) each atom α(y, y) by atoms α s ,s (y, y) and Υ (y) with s, s , s , s f , Υ as in Step 6. 8. If D ∈ N C is the concept chosen in Step 6, add D(y) to q. If D = ∃P − , add α P (z, y) to q, where z is a fresh variable and L(α P ) = {(P, true)}. Go to Step 1.

Slightly abusing notation, we will use rewrite(q, T ) to denote the set of queries which are output by some execution of rewrite on input q,T . We remark that the number of variables and atoms in each query in rewrite(q, T ) is linearly bounded by the original q. This is the key property used to show the following: Lemma 3. There are only exponentially many queries in rewrite(q, T ) (up to equivalence), and each one has size polynomial in |q|.

The next lemma shows that using rewrite(q, T ), we can reduce the problem of finding an arbitrary query match to finding a match involving only ABox individuals. Lemma 4. T , A |= q if and only if there exists a match π for some query q ∈ rewrite(q, T ) in I A,T such that π(t) ∈ Ind(A) for every term t in q .

Query Evaluation We have reduced T , A |= q to checking whether there is a match π for some q ∈ rewrite(q, T ) with π(t) ∈ Ind(A) for every term t in q . However, even when all terms are mapped to ABox individuals, the paths between them may need to pass by the anonymous part in order to satisfy the regular expressions in the query. To handle this problem, we define a relaxed notion of query entailment, which exploits the fact that if all variables are mapped to Ind(A), only loops (that is, paths from an individual a to itself in I A,T | a ) may participate in the paths between them. Hence, we look for paths in the ABox that may use such loops to skip states in the query automata.

Thus, as part of our evaluation procedure, we will need to decide for a given individual a whether a is an L(α s,s )-successor of itself in I A,T | a . We cannot use the table Loop α directly, since it does not take into account the concepts which are entailed due to ABox assertions. We note however that the set of loops starting from a given individual is fully determined by the set of concepts in BC which the individual satisfies. We thus introduce a new table ALoop Definition 2. We write T , A |≈ q if there is a mapping π from the terms in q to Ind(A) such that:

α such that ALoop α [s, s ] contains all subsets G ⊆ BC such that a is an L(α s,s )-successor of itself in I A,T | a whenever G = {C ∈ BC | a ∈ C I T ,A }.(a) π(c) = c for each c ∈ N I , (b) {A ∈ N C | T , A |= A(π(t))} |= β

for each atom β(t) in q, and (c) for each α(t, t ) ∈ q with α = S, Σ, δ, s, F , there is a sequence (a 0 , s 0 ), . . . (a n , s n ) of distinct pairs from Ind(A) × S such that a 0 = π(t), a n = π(t ), s 0 = s, s n ∈ F , and for every 0 ≤ i < n, one of the following holds:

(i) a i = a i+1 and {C ∈ BC | T , A |= C(a i )} ∈ ALoop α [s i , s i+1 ] (ii) there is some (Γ, Υ ) ∈ Σ with (s i , (Γ, Υ ), s i+1 ), {R ∈ N R | T , A |= R(a i , a i+1 )} |= Γ and {A ∈ N C | T , A |= A(a i+1 )} |= Υ .

Lemma 6. T , A |= q if and only if T , A |≈ q for some q ∈ rewrite(q, T ).

Using the preceding lemma, we can derive our upper bounds: Proof. By Lemmas 4 and 6, we can reduce T , A |= q to deciding whether T , A |≈ q for some q ∈ rewrite(q, T ). For items 1 and 2, if T and q are fixed, then computing rewrite(q, T ) requires only constant time in |A|. To decide whether T , A |≈ q for q ∈ rewrite(q, T ), we guess a mapping π from the terms in q to Ind(A) and verify that it satisfies the conditions in Definition 2. Note that for condition (c), we cannot keep the whole sequence (a 0 , s 0 ), . . . (a n , s n ) in memory at once, so we use a binary counter that counts up to Ind(A) × |S| and store only one pair of nodes (a i , s i ), (a i+1 , s i+1 ) at a time. To verify conditions (b) and (c)(ii) we need checks of the form

Proposition 2. Boolean -C2RPQ{R ∈ N R | T , A |= R(a, b)} |= Γ and {A ∈ N C | T , A |= A(a)} |= Υ .

Each one amounts to a fixed number of instance checks (one for each symbol in Γ or Υ ), hence the data complexity of these checks is the same as for instance checking in the corresponding DL: in AC 0 for DL-Lite R , and in P for ELH. This yields the desired upper bounds: NLSPACE for the former, and NLSPACE P =P for the latter. For statement 4, instead of building the whole set rewrite(q, T ), which can be exponential, we generate a single q ∈ rewrite(q, T ) non-deterministically. More precisely, we take the initial query q and apply a sequence of rewriting steps to obtain some q ∈ rewrite(q, T ). By Lemma 3, every query in rewrite(q, T ) can be generated after at most exponentially many steps, so we can use a polynomial-sized counter to check when we have reached this limit. Since each rewritten query is of polynomial size (Lemma 3), and we keep a single query in memory at a time, the generation of a single query in rewrite(q, T ) requires only polynomial space. Then we can use the same strategy as above to decide in polynomial space whether T , A |≈ q . We thus have a non-deterministic polynomial space procedure for deciding T , A |= q. Using the well-known fact that NPSPACE =PSPACE, we obtain the desired upper bound.

For statement 3, we note that if T is an DL-Lite RDFS TBox, then the query cannot be rewritten, i.e. rewrite(q, T ) = {q}. Thus, it suffices to decide whether T , A |≈ q. We then remark that the procedure described above is in NP, since we guess a (polysize) mapping π and verify in polytime that π satisfies the conditions of Definition 2.

Future Work

In future work, we plan to study what types of restrictions on the TBox and query lead to better combined complexity. We also wish to explore other types of path-based query languages. One interesting extension which has been recently proposed for graph databases [3] is the addition of path variables. In Boolean queries, these prove useful for speaking about equality of paths. For example, one might want to find whether there is a chain of advisors of the same length between both Edmund Clarke and Bernoulli, and Jack Minker and Bernoulli. Things become even more interesting if we allow path variables in the output. By outputting the paths for the preceding query, we could discover that Clarke and Minker have a path to Bernoulli of length 12. We could even output the common areas of expertise of the scientists along the path to find out that both have an 8-step path to some physicist (Poisson and Fourier), that in turn reach Bernoulli via an important figure in analysis (Lagrange) and a graph theorist (Euler).

Fig. 1 :1Fig. 1: Example graph database of the Mathematics Genealogy Project

Fig. 2 :2Fig. 2: An example MGP ontology

a set of assertions of the form A(b) or r(b, c), where A ∈ N C , r ∈ N R , and b, c ∈ N I . A TBox is a set of inclusions, whose form depends on the DL in question. In DL-Lite, inclusions take the form B 1 (¬)B 2 , where each B i is either A (where A ∈ N C ) or ∃R (where R ∈ N R ). DL-Lite R additionally allows role inclusions of the form R 1 (¬)R 2 , where R 1 , R 2 ∈ N R . DL-Lite RDFS is obtained from DL-Lite R by disallowing inclusions which contain negation or have existential concepts (∃R) on the right-hand side. In EL, inclusions have the form C 1 C 2 , where C 1 , C 2 are complex concepts constructed as follows: C := | A | C C | ∃r.C. The DL ELH additionally allows role inclusions of the form r s, where r, s ∈ N R . A knowledge base (KB) K = (T , A) consists of a TBox T and an ABox A.

5 .5Drop all atoms α s,s (y, y) such that C ∈ Loop α [s, s ]. 6. Choose some D ∈ BC and R ∈ N R such that: (a) if T is a DL-Lite R TBox, then C = ∃R − and T |= D ∃R. (a') if T is an ELH TBox, then R ∈ N R and T |= D ∃R.C. (b) for each atom of the form α(y, x) with α = S, Σ, δ, s, F , there exist s ∈ S and (Γ, Υ ) ∈ Σ with (s, (Γ, Υ ), s ) ∈ δ and {U ∈ N R | T |= R − U } |= Γ . (c) for each atom of the form α(x, y) with α = S, Σ, δ, s, F , there exists s ∈ S, s f ∈ F and some (Ω, Θ) ∈ Σ with (s , (Ω, Θ), s f ) ∈ δ such that {U ∈ N R | T |= R U } |= Ω and {A ∈ N C | T |= C A} |= Θ. Note that both (b) and (c) apply to an atom of the form α(y, y).

Lemma 5 .5Note that the size of ALoop α is exponential in |T |, but the associated decision problem is in P: It can be decided in polytime in |T | and |α| whether G ∈ ALoop α [s, s ].

Foundations of Databases SAbiteboul RHull VVianu 1995 Addison-Wesley Pushing the EL envelope FBaader SBrandt CLutz Proc. of IJCAI of IJCAI 2005 Expressive languages for path queries over graph-structured data PBarceló CAHurtado LLibkin PTWood Proc. of PODS of PODS 2010 Tractable reasoning and efficient query answering in description logics: The DL-Lite family DCalvanese GDe Giacomo DLembo MLenzerini RRosati Journal of Automated Reasoning 39 3 2007 Answering regular path queries in expressive description logics: An automata-theoretic approach DCalvanese TEiter MOrtiz Proc. of AAAI of AAAI 2007 Regular path queries in expressive description logics with nominals DCalvanese TEiter MOrtiz Proc. of IJCAI of IJCAI 2009 Data complexity of query answering in description logics DCalvanese GDGiacomo DLembo MLenzerini RRosati Proc. of KR of KR 2006 Graphlog: a visual formalism for real life recursion MPConsens AOMendelzon Proc. of PODS of PODS 1990 Query rewriting for Horn-SHIQ plus rules TEiter MOrtiz MŠimkus TTran GXiao Proc. of AAAI of AAAI 2012 Lower bounds for natural proof systems DKozen Proc. of FOCS of FOCS 1977 Query answering in the horn fragments of the description logics SHOIQ and SROIQ MOrtiz SRudolph MSimkus Proc. of IJCAI of IJCAI 2011