=Paper= {{Paper |id=None |storemode=property |title=Answering Expressive Path Queries over Lightweight DL Knowledge Bases |pdfUrl=https://ceur-ws.org/Vol-846/paper_28.pdf |volume=Vol-846 |dblpUrl=https://dblp.org/rec/conf/dlog/BienvenuOS12 }} ==Answering Expressive Path Queries over Lightweight DL Knowledge Bases== https://ceur-ws.org/Vol-846/paper_28.pdf
                Answering Expressive Path Queries
              over Lightweight DL Knowledge Bases ?

               Meghyn Bienvenu1 , Magdalena Ortiz2 , and Mantas Šimkus2
                              1
                                   LRI - CNRS & Université Paris Sud
              2
                  Institute of Information Systems, Vienna University of Technology



        Abstract. We establish tight complexity bounds for answering an extension of
        conjunctive 2-way regular path queries over EL and DL-Lite knowledge bases.


1     Introduction

It has been extensively argued in the description logic (DL) community that answering
queries over ABoxes in the presence of ontological constraints formulated in a DL TBox
is a fundamental reasoning service. In databases, similar attention has been paid to the
related problem of querying graph databases, which are relational databases where only
unary and binary predicates occur, or in other words, node- and edge-labeled graphs [8,
3]. The relevance of both problems lies in the fact that in many application areas data
can be naturally modeled as an ABox or a graph database. This applies, in particular, to
XML data on the web, including RDF datasets. While the two communities share some
common research goals, the research agendas they have pursued differ significantly. In
the DL community, the focus has been on designing efficient algorithms for answering
(plain) conjunctive queries in the presence of expressive ontological constraints. By
contrast, work on graph databases typically does not consider ontological knowledge,
but instead aims at supporting expressive query languages, like regular path queries
(RPQs) and their extensions, which enable sophisticated navigation of paths.
    This paper aims to help bridge this gap, by considering an expressive extension
of RPQs, and providing algorithms and precise complexity bounds for the EL and
DL-Lite families of lightweight DLs. We build on conjunctive (2-way) regular path
queries (C2RPQs), which simultaneously extend plain conjunctive queries (CQs) and
basic RPQs: they allow conjunctions of atoms that can share variables in arbitrary ways,
where the atoms may contain regular expressions that navigate the arcs of the database
(roles) in both directions. C2RPQs are one of the most expressive and popular languages
for querying graph databases. These queries have already been studied for some DLs.
In particular, automata-based algorithms have been proposed for the very expressive
DLs ZIQ, ZIO, and ZOQ [5, 6], for which query answering is 2-E XP T IME hard.
Even in data complexity, that is, when the query and ontology are assumed fixed, these
algorithms need exponential time. More recently, algorithms for answering C2RPQs
were proposed in [11] for Horn-SHOIQ and Horn-SROIQ. They are polynomial in
data complexity but still E XP T IME in combined, which is worst-case optimal for these
?
    This work partially supported by the Austrian Science Fund (FWF) grants P20840 and T515.
logics. For prominent lightweight DLs like the DL-Lite [4] and EL [2], which underly
the OWL 2 profiles, queries with regular paths had not been explored and many ques-
tions remained open, like whether algorithms that require only polynomial space are
possible. In DL-Lite, FO-rewritability and AC0 data complexity are clearly lost, since
we can express reachability, but it was not known whether P-hardness was avoidable.
In this paper, we answer these questions by providing precise complexity bounds.
    We propose an extension of C2RPQs that we call conjunctive (2-way) regular path
queries with complex labels, abbreviated `-C2RPQs. To illustrate its expressiveness, we
consider a graph representation of the Mathematics Genealogy Project (MGP) database,
which contains about 160K historic records of advisor relationships of PhD holders in
mathematics and related disciplines. We use nodes for mathematicians, theses, topics of
research, and universities. Nodes and edges are labeled with concepts (unary relations)
and roles (binary relations), respectively. Figure 1 depicts a fragment of such a graph.

                                                          Unification and
                                                         Matching Problems
                                                   itle
                                             hasT                                                      Country
                            Thesis
                             ThesisJS76        submittedTo    Univ. Essex        locatedIn         UK
                               s is         hasT
                         T  he                   opic                                                                     ComputerScience
                                                          ComputerScience
                  wrote
                              worksIn          AI-AutDed1                                                        AI-LogProg1
      Jörg
   Siekmann         worksI
                              n           ComputerScience                                      pic
                                    AI-MAS3                                             hasTo                                          Country
              ha                                           Thesis
                 sA                                                      submittedTo               Univ.             locatedIn   Germany
                    dv                                      ThesisGS89
                      iso                               is                                    Kaiserslautern
                          r                        Thes                               hasTitle
                                              wrote
                                                 worksIn         AI-DLs1
                                    Gert                   MathLogic&Foundations                                  Logic Programming
                                   Smolka
                                                       worksIn                                                   over Polymorphically
                                                                           AI-HyLo1                               Order-Sorted Types
                                                            MathLogic&Foundations

          Fig. 1: Example graph database of the Mathematics Genealogy Project
    An ontology containing the axioms in Fig. 2 can be used to express, for example,
that a person that works in computer science or that wrote a doctoral thesis in computer
science is a computer scientist (1,2). In a similar way we can define other specialities
such as biologists (3,4), logicians (5,6), physicists, etc. We group the first level subjects
of the Mathematics Subject Classification (MSC) used in the MGP database into their
5 major areas (7–11), and these major areas into subjects (12–16).
   (1)                               ∃worksOn.CompSci v CompScientist
   (2)                ∃wroteThesis.(∃hasTopic.CompSci) v CompScientist
   (3)                ∃worksOn.Biology&NaturalSciences v Biologist
   (4) ∃wroteThesis.(∃hasTopic.Biology&NaturalSciences) v Biologist
   (5)                    ∃worksOn.MathLogic&Foundns v Logician
   (6)     ∃wroteThesis.(∃hasTopic.MathLogic&Foundns) v Logician
  (7)      MathLogic&Foundns v General&Foundns        (12)      General&Foundns v Subject
  (8)                 Geometry v Geometry&Topology (13) DiscreteMath&Algebra v Subject
  (9)                  CompSci v AppliedMath&Other (14)                 Analysis v Subject
  (10) Biology&NaturalSciences v AppliedMath&Other (15) Geometry&Topology v Subject
  (11)                  Physics v AppliedMath&Other (16) AppliedMath&Other v Subject

                                           Fig. 2: An example MGP ontology
    In RPQs, C2RPQs and similar languages, regular paths usually talk about the arc
labels only, and do not allow to verify conditions on the node labels. For example, one
can use the expression hasAdvisor ∗ ◦WroteThesis ◦ hasTopic to navigate arbitrarily long
chains of advisors and visit their thesis topics, but we cannot look at the subjects and
thesis topics and impose conditions on them. This limitation is not so significant for
graph databases, as arc labels play a more prominent role and are often used to simulate
node labels. In the DL setting, by contrast, concepts are crucial and should be treated as
first-class citizens. For this reason, we add to C2RPQs the ability to talk about combina-
tions of concept and roles that appear along a path. In our language, we can use the ex-
pression (hasAdvisor , Logician ∨CompScientist)∗ ◦WroteThesis ◦ (hasTopic, Geometry)
to navigate a chain of advisors that are computer scientist or logicians, until we reach
one that wrote a doctoral thesis in Geometry. Note that this query could be expressed
(less succinctly) using the test operator (cf. [6]): (hasAdvisor ◦ Logician? ∪ hasAdvisor ◦
CompScientist?)∗ ◦ WroteThesis ◦ hasTopic ◦ Geometry?. However, our language is
slightly more expressive (for DLs without role conjunction), since we can navigate a
chain of people that are both advisors and coauthors using (hasAdvisor ∧ coAuthor )∗ .
     In this paper, we show that answering these queries over EL and DL-Lite knowl-
edge bases is PS PACE-complete, but drops to NP if we consider DL-LiteRDFS . For data
complexity, the problem is NLS PACE-complete for DL-Lite and P-complete for EL.


2   Preliminaries

We briefly recall the syntax of DL-LiteR [4] and ELH [2] (and relevant sublogics). As
usual, we assume sets NC , NR , and NI of concept names, role names, and individuals.
We will use NR to refer to NR ∪{r− | r ∈ NR }, and if R ∈ NR , we use R− to mean r− if
R = r and r if R = r− . An ABox is a set of assertions of the form A(b) or r(b, c), where
A ∈ NC , r ∈ NR , and b, c ∈ NI . A TBox is a set of inclusions, whose form depends
on the DL in question. In DL-Lite, inclusions take the form B1 v (¬)B2 , where each
Bi is either A (where A ∈ NC ) or ∃R (where R ∈ NR ). DL-LiteR additionally allows
role inclusions of the form R1 v (¬)R2 , where R1 , R2 ∈ NR . DL-LiteRDFS is obtained
from DL-LiteR by disallowing inclusions which contain negation or have existential
concepts (∃R) on the right-hand side. In EL, inclusions have the form C1 v C2 , where
C1 , C2 are complex concepts constructed as follows: C := > | A | C u C | ∃r.C. The
DL ELH additionally allows role inclusions of the form r v s, where r, s ∈ NR . A
knowledge base (KB) K = (T , A) consists of a TBox T and an ABox A.
    As usual, the semantics is based upon interpretations, which take the form I =
(∆I , ·I ), where ∆I is a non-empty set and ·I maps each a ∈ NI to aI ∈ ∆I , each
A ∈ NC to AI ⊆ ∆I , and each r ∈ NR to rI ⊆ ∆I × ∆I . The function ·I is
straightforwardly extended to general concepts and roles, e.g. (¬A)I = ∆I \ AI and
(∃r.C)I = {c | ∃d : (c, d) ∈ rI , d ∈ C I }. I satisfies G v H if GI ⊆ H I ; it satisfies
A(a) (resp. r(a, b)) if aI ∈ AI (resp. (aI , bI ) ∈ rI ). I is a model of K = (T , A) if I
satisfies all inclusions in T and assertions in A.
    To simplify the presentation, we will assume that ELH TBoxes are normalized,
meaning that all concept inclusions are of one of the following forms:
            >vA          AvB         A v ∃r.B       B 1 u B2 v A         ∃r.B v A
with A, B, B1 , B2 concept names. It is well-known that for every ELH TBox T , one
can construct in polynomial time a normalized ELH TBox T 0 that uses fresh concept
names such that T 0 |= T and every model of T can be expanded to a model of T 0 .
    For convenience, we introduce a set of basic concepts, denoted BC, defined as fol-
lows: BC = NC ∪ {∃R | R ∈ NR } for DL-LiteR , and BC = NC for ELH.
Canonical Models We recall the definition of canonical models for DL-LiteR and ELH
KBs. For both logics, the domain of the canonical model IT ,A for a KB (T , A) will
consist of paths of the form aR1 C1 . . . Rn Cn (n ≥ 0), where a ∈ Ind(A), each Ci is a
basic concept, and each Ri a (possibly inverse) role. When T is a DL-LiteR TBox, the
domain ∆IT ,A contains exactly those paths aR1 ∃R1− . . . Rn ∃Rn− which satisfy:
  – if n ≥ 1, then T , A |= ∃R1 (a);
  – for 1 ≤ i < n, T |= ∃Ri− v ∃Ri+1 and Ri− 6= Ri+1 .
When T is a (normalized) ELH TBox, the domain ∆IT ,A contains exactly those paths
ar1 A1 . . . rn An for which each ri ∈ NR , and:
  – if n ≥ 1, then T , A |= ∃r1 .A1 (a);
  – for 1 ≤ i < n, T |= Ai v ∃ri+1 .Ai+1 .
We denote the last concept in a path p by tail(p), and define IT ,A by taking:
 aIT ,A = a for all a ∈ Ind(A)
AIT ,A = {a ∈ Ind(A) | T , A |= A(a)} ∪ {p ∈ ∆IT ,A \ Ind(A) | T |= tail(p) v A}
 rIT ,A = {(a, b) | r(a, b) ∈ A} ∪
             {(p1 , p2 ) ∈ ∆IT ,A × ∆IT ,A | p2 = p1 · S C and T |= S v r}∪
             {(p2 , p1 ) ∈ ∆IT ,A × ∆IT ,A | p2 = p1 · S C and T |= S v r− }
Note that IT ,A is composed of a core consisting of the ABox individuals and an anon-
ymous part consisting of (possibly infinite) trees rooted at ABox individuals. We will
use IT ,A |e to denote the submodel of IT ,A obtained by restricting the universe to paths
having e as a prefix.

Regular Languages We assume the reader is familiar with regular languages, repre-
sented either by regular expressions or nondeterministic finite state automata (NFAs).
An NFA over an alphabet Σ is a tuple α = hS, Σ, δ, s0 , F i, where S is a finite set of
states, δ ⊆ S × Σ × S the transition relation, s0 ∈ S the initial state, and F ⊂ S the
set of final states. We use L(α) to denote the language defined by an NFA α, and when
the way a regular language is represented is not relevant, we denote it simply by L.


3   Conjunctive Regular Path Queries with Complex Labels
We now formally introduce our query language.
Definition 1. By B(S) we denote the set of all (positive) Boolean formulas built from
the symbols in S ∪ {true, false} using the connectives ∧ and ∨. A conjunctive (two-
way) regular path query with complex labels (abbreviated to `-C2RPQ) has the form
q(x) = ∃y ϕ where x and y are tuples of variables, and ϕ is a conjunction of atoms
of the following forms:
  (i) β(t), where β ∈ B(NC ) and t ∈ NI ∪ x ∪ y, and
 (ii) L(t, t0 ), where L is (an NFA or regular expression defining) a regular language
      over B(NR ) × B(NC ), and t, t0 ∈ NI ∪ x ∪ y.
As usual, variables and individuals are called terms, and the variables in x are called
answer variables. A query with no answer variables is called Boolean.
    If all atoms of type (i) are of the form A(t) with A ∈ NC , and the regular languages
L in atoms of type (ii) comprise only symbols of the form (R, true) with R ∈ NR , then
q is called a conjunctive (two-way) regular path query (C2RPQ). Conjunctive one-way
regular path queries with complex labels (`-CRPQs) and conjunctive one-way regular
path queries (CRPQs) are defined analogously, the only difference being that they use
formulas from B(NR ) instead of B(NR ) in atoms of type (ii). Conjunctive queries (CQs)
are C2RPQs where all atoms of type (ii) have the form (R, true)(t, t0 ) with R ∈ NR .

Example 1. For readability, we write symbols (R, true) ∈ B(NR ) × B(NC ) simply as
R. Recall the MGP database. The query q1 (x, y) in Fig. 3 searches for pairs of com-
puter scientists that have a biologist as common academic ancestor, and such that there
is a physicist on that path. It returns, among others, Jack Minker and Edmund Clarke,
who have the biologist and mathematician Johann Bernoulli (1667–1748) as common
ancestor on a path including the physicist and mathematician Joseph Fourier (1768 –
1830); as well as Gert Smolka and Georg Gottlob, who have as ancestors Nikolaus Poda
von Neuhaus, an Austrian entomologist (1723 - 1798), and the physicist Ludwig Boltz-
mann (1844 – 1906). The query q2 (x, y) searches for a common academic ancestor x
of Robert Kowalski and Franz Baader, together with the country y where the ancestor’s
thesis was defended; it requires all ancestors on the path to be computer scientists and
logicians. It returns one tuple: (Bernard Meltzer, UK). The query q3 (x) is similar but
we require x to be a biologist or physicist, and additionally allow physicists along the
path. This query retrieves 8 people, going back to Gabriel Gruber (1740–1805), a Jesuit
priest, philosopher, mathematician and professor of physics.
 q1 (x, y) = ˆCompScientist ∨ Logician(x), CompScientist ∨ Logician(y),
               hasAdvisor ∗ ◦ (hasAdvisor , Physicist) ◦ hasAdvisor ∗ ◦ (hasAdvisor , Biologist)
                     ◦ (hasAdvisor − )∗ ◦ (hasAdvisor − , Physicist) ◦ (hasAdvisor − )∗ (x, y)
                                                                                        ˜
              `                                         ´∗
 q2 (x, y) = [`hasAdvisor , CompScientist ∧ Logician ´ ](RKowalski, x),
             [ hasAdvisor , CompScientist ∧ Logician ∗ ](FBaader , x)
             [wroteThesis ◦ submittedTo ◦ locatedIn](x, y)
   q3 (x) = [(hasAdvisor , ((CompScientist ∧ Logician) ∨ Physicist))∗
                    ◦ (hasAdvisor , Biologist ∨ Physicist)](RKowalski,
                                                                   ´ x)
            [ hasAdvisor , ((CompScientist ∧ Logician) ∨ Physicist) ∗ ◦(hasAdvisor )](FBaader , x)
             `


                                     Fig. 3: Example queries
    We now define the semantics of `-C2RPQs. We say that a set X ⊂ X satisfies a
formula ϕ ∈ B(X), written X |= ϕ, if the formula that results from replacing each
v ∈ X by true if v ∈ X and by false otherwise is equivalent to true. For a regular
language L over the alphabet B(NR ) × B(NC ), we call d2 an L-successor of d1 in I if
there is some w = (Γ1 , Υ1 ) . . . (Γn , Υn ) ∈ L and some sequence e0 , . . . , en of elements
in ∆I such that e0 = d1 , en = d2 , and, for all 1 ≤ i ≤ n:

        {R ∈ NR | hei−1 , ei i ∈ RI } |= Γi           and     {A ∈ NC | ei ∈ AI } |= Υi .

    A match for a Boolean `-C2RPQ q in an interpretation I is a mapping π from the
terms in q to elements in ∆I such that:
  – π(c) = cI if c ∈ NI ,
  – {A ∈ NC | π(t) ∈ AI } |= β for each atom β(t) in q, and
  – π(t0 ) is an L-successor of π(t) for each atom L(t, t0 ) in q.
We write I |= q if there is a match for q in I, and T , A |= q if I |= q for every model
I of T , A.
    Given an `-C2RPQ q with answer variables v1 , . . . , vk , we say that a tuple of indi-
viduals (a1 , . . . , ak ) is a certain answer for q w.r.t. T , A just in the case that in every
model I of T , A there is a match π for q such that π(vi ) = aIi for every 1 ≤ i ≤ k. Just
as for CQs and C2RPQs, deciding whether a tuple of individuals is a certain answer for
an `-C2RPQ can be linearly reduced to Boolean `-C2RPQ entailment. For this reason,
we consider only the latter problem in what follows.
    It is well known that the canonical model IT ,A can be homomorphically embedded
into any model of T , A, hence a CQ q is entailed by T , A if and only if there is a match
for q in IT ,A . This result can be easily lifted from CQs to `-C2RPQs, as `-C2RPQs are
also monotonic and their matches are preserved under homomorphisms.

Lemma 1. For every DL-LiteR or ELH KB (T , A) and Boolean `-C2RPQ q: T , A |=
q if and only if IT ,A |= q.

    This property will be a crucial element in establishing our main theorem:

Theorem 1. Boolean `-C2RPQ entailment is NLS PACE-complete in data complexity
and PS PACE-complete in combined complexity for DL-LiteR ; the combined complexity
drops to NP-complete for DL-LiteRDFS . For ELH, the problem is P-complete in data
complexity and PS PACE-complete in combined complexity. All lower bounds hold also
for CRPQs and in the absence of role inclusions.

   We split the proof of this theorem into parts, with the lower bounds shown in the
next section, and the (more involved) proofs of the upper bounds outlined in Section 5.


4   Lower Bounds

We start by establishing the required lower bounds.
Proposition 1. Boolean CRPQ entailment is

 1. NLS PACE-hard in data complexity for DL-LiteRDFS ;
 2. P-hard in data complexity for EL;
 3. NP-hard in combined complexity for DL-LiteRDFS ;
 4. PS PACE-hard in combined complexity for DL-Lite and EL.

Proof. Statement (1) follows from the analogous result for graph databases [8]. It can
be shown by a simple reduction from the NLS PACE-complete directed reachability
problem: y is reachable from x in a directed graph G if and only if (x, y) is an an-
swer to r∗ (x, y) w.r.t. the ABox AG encoding G. Statement (2) is immediate given the
P-hardness in data complexity of CQ entailment in EL [7], and (3) follows from the
well-known NP-hardness in combined complexity of CQ entailment for databases [1].
    For statement (4), we give a reduction from the problem of emptiness of the in-
tersection of an arbitrary number of regular languages, which is known to PS PACE-
complete [10]. Consider some regular languages L1 , . . . , Ln over alphabet Σ. We will
use the symbols in Σ as role names, and we add a concept name A. Then we set
A = {A(a)} and q = ∃x L1 (a, x) ∧ . . . ∧ Ln (a, x). For DL-Lite, we will use the
following TBox: T = {A v ∃r | r ∈ Σ} ∪ {∃r− v ∃s | r, s ∈ Σ}. For EL, we can
use T = {A v ∃r.A | r ∈ Σ}. Notice that in both cases the canonical model IT ,A
consists of an infinite tree rooted at a such that every element in the interpretation has a
unique r-child for each r ∈ Σ (and no other children). Thus, we can associate to every
domain element the word over Σ given by the unique path from a, and moreover, for
every word w ∈ Σ ∗ we can find an element ew whose path from a is exactly w. This
means that if w ∈ L1 ∩ . . . ∩ Ln , we obtain a match for q in the canonical model by
mapping x to ew . Conversely, if q is entailed, then any match in the canonical model
defines a word which belongs to every Li , which means L1 ∩. . .∩Ln is non-empty. t       u


5   Upper Bounds

The main objective of this section will be to define procedure for deciding IT ,A |=
q for a given KB T , A and a given `-C2RPQ q. The procedure comprises two main
steps. First, we rewrite q into a set Q of `-C2RPQs such that IT ,A |= q if and only if
IT ,A |= q 0 for some q 0 ∈ Q. The advantage of the rewritten queries is that in order
to decide whether IT ,A |= q 0 , we will only need to consider matches which map the
variables to Ind(A). The second step evaluates the rewritten queries over the core part
of the canonical model involving only Ind(A).
Preliminary Notions In order to more easily manipulate regular languages, it will
prove convenient to use NFAs rather than regular expressions. Thus, in what follows, we
assume all binary atoms take the form α(t, t0 ), where α is an NFA over B(NR )×B(NC ).
Given α = hS, Σ, δ, s0 , F i, we use αs,G to denote the NFA hS, Σ, δ, s, Gi, i.e. the NFA
with the same states and transitions as α but with initial state s and final states G.
     A key to defining our rewriting procedure will be to understand how an atom L(t, t0 )
can be satisfied in the anonymous part of the canonical model IT ,A . A subtlety arises
from the fact that the path witnessing the satisfaction of an atom L(t, t0 ) may be quite
complicated: it may move both up and down, passing by the same element multiple
times, and possibly descending below t0 . This will lead us to decompose an atom L(t, t0 )
into multiple “smaller” atoms corresponding to segments of the L-path which are situ-
ated wholly above or below an element. Importantly, we know that the canonical model
displays a high degree of regularity, since whenever two elements p1 and p2 in the
anonymous part end with the same concept (i.e. Tail(p1 ) = Tail(p2 )), the submod-
els IT ,A |p1 and IT ,A |p2 are isomorphic. In particular, this means that if Tail(p1 ) =
Tail(p2 ), then p1 is an L-successor of itself in the interpretation IA,T |p1 just in the case
that p2 is an L-successor of itself in the interpretation IA,T |p2 .
     We now wish to define a way of testing for a given TBox T and NFA α with states
s, s0 whether Tail(e) = C ensures that there is a loop from e back to itself, situated
wholly within IT ,A |e , which takes α from state s to state s0 . To this end, we construct
a table Loopα which contains for each pair s, s0 of states in α, a subset of BC. If T is a
DL-LiteR TBox, then Loopα is defined inductively using the following rules:

(1) for every s ∈ S, Loopα [s, s] = BC
(2) if C ∈ Loopα [s1 , s2 ] and C ∈ Loopα [s2 , s3 ], then C ∈ Loopα [s1 , s3 ]
(3) if C ∈ BC, T |= C v ∃R, ∃R− ∈ Loopα [s2 , s3 ],
    (s1 , (Γ, Υ ), s2 ) ∈ δ, (s3 , (Γ 0 , Υ 0 ), s4 ) ∈ δ,
    {U ∈ NR | T |= R v U } |= Γ , {A ∈ NC | T |= ∃R− v A} |= Υ ,
    {U ∈ NR | T |= R− v U } |= Γ 0 , and {A ∈ NC | T |= C v A} |= Υ 0 ,
    then C ∈ Loopα [s1 , s4 ]
For ELH, we replace the third rule by:

(3’) if C ∈ BC, T |= C v ∃r.D, D ∈ Loopα [s2 , s3 ],
     (s1 , (Γ, Υ ), s2 ) ∈ δ, (s3 , (Γ 0 , Υ 0 ), s4 ) ∈ δ,
     {s ∈ NR | T |= r v s} |= Γ , {A ∈ NC | T |= D v A} |= Υ ,
     {s− ∈ NR | T |= r v s} |= Γ 0 , and {A ∈ NC | T |= C v A} |= Υ 0 ,
     then C ∈ Loopα [s1 , s4 ]

Note that the table Loopα can be constructed in polynomial time in |T | and |α| since
entailment of inclusions is polynomial for both DL-LiteR and ELH. The following
lemma shows that Loopα has the desired meaning:
Lemma 2. For every element p ∈ ∆IA,T \ Ind(A): Tail(p) ∈ Loopα [s, s0 ] if and only
if p is an L(αs,s0 )-successor of itself in the interpretation IA,T |p .

Query Rewriting Our aim is to rewrite our query in such a way that we do not need
to map any variables to the anonymous part of the model. We draw our inspiration
from a query rewriting procedure for Horn-SHIQ described in [9]. The main intuition
is as follows. Suppose we have a match π for q which maps some variable y to the
anonymous part, and no other variable is mapped below π(y). Then we modify q so
that it has essentially the same match except that variables mapped to π(y) are now
mapped to the (unique) parent of π(y) in IT ,A . The delicate point is that we must
“split” atoms of the form α(t, t0 ) with y ∈ {t, t0 } into the parts which are satisfied in
the subtree IT ,A |π(y) (these have already been shown to hold, so can be dropped), and
those which occur above π(y), whose satisfaction still needs to be determined and thus
must be incorporated into the new query. With each iteration of the rewriting procedure,
we obtain a query which has a match which maps variables “closer” to the core of IT ,A ,
until eventually we find some query that has a match which maps all terms to Ind(A).
    We now give a recursive non-deterministic query rewriting procedure which imple-
ments the above intuition.

P ROCEDURE rewrite(q, T )
 1. Choose either to output q or to continue.
 2. Choose a non-empty set Leaf ⊆ vars(q) and y ∈ Leaf. Rename all variables in
    Leaf to y.
 3. Choose some concept C ∈ BC such that {B | T |= C v B} |= ϕ for every atom
    ϕ(y). Drop all such atoms from q.
 4. For each atom α(t, t0 ) where α = hS, Σ, δ, s, F i is a NFA and y ∈ {t, t0 },
     – choose a sequence s1 , . . . sn of distinct states from S such that sn ∈ F ,
     – replace the atom α(t, t0 ) in q by the atoms αs,s1 (t, y), αs1 ,s2 (y, y), . . . ,
        αsn−2 ,sn−1 (y, y), αsn−1 ,sn (y, t0 ).
 5. Drop all atoms αs,s0 (y, y) such that C ∈ Loopα [s, s0 ].
 6. Choose some D ∈ BC and R ∈ NR such that:
    (a) if T is a DL-LiteR TBox, then C = ∃R− and T |= D v ∃R.
    (a’) if T is an ELH TBox, then R ∈ NR and T |= D v ∃R.C.
    (b) for each atom of the form α(y, x) with α = hS, Σ, δ, s, F i, there exist s0 ∈ S
         and (Γ, Υ ) ∈ Σ with (s, (Γ, Υ ), s0 ) ∈ δ and {U ∈ NR | T |= R− v U } |= Γ .
    (c) for each atom of the form α(x, y) with α = hS, Σ, δ, s, F i, there exists s00 ∈ S,
         sf ∈ F and some (Ω, Θ) ∈ Σ with (s00 , (Ω, Θ), sf ) ∈ δ such that {U ∈ NR |
         T |= R v U } |= Ω and {A ∈ NC | T |= C v A} |= Θ.
    Note that both (b) and (c) apply to an atom of the form α(y, y).
 7. Replace
      – each atom α(y, x) with x 6= y by atoms αs0 ,sf (y, x) and Υ (y)
      – each atom α(x, y) with x 6= y by αs,s00 (x, y)
      – each atom α(y, y) by atoms αs0 ,s00 (y, y) and Υ (y)
    with s, s0 , s00 , sf , Υ as in Step 6.
 8. If D ∈ NC is the concept chosen in Step 6, add D(y) to q. If D = ∃P − , add
    αP (z, y) to q, where z is a fresh variable and L(αP ) = {(P, true)}. Go to Step 1.
    Slightly abusing notation, we will use rewrite(q, T ) to denote the set of queries
which are output by some execution of rewrite on input q,T . We remark that the num-
ber of variables and atoms in each query in rewrite(q, T ) is linearly bounded by the
original q. This is the key property used to show the following:
Lemma 3. There are only exponentially many queries in rewrite(q, T ) (up to equiva-
lence), and each one has size polynomial in |q|.
    The next lemma shows that using rewrite(q, T ), we can reduce the problem of find-
ing an arbitrary query match to finding a match involving only ABox individuals.
Lemma 4. T , A |= q if and only if there exists a match π for some query q 0 ∈
rewrite(q, T ) in IA,T such that π(t) ∈ Ind(A) for every term t in q 0 .

Query Evaluation We have reduced T , A |= q to checking whether there is a match π
for some q 0 ∈ rewrite(q, T ) with π(t) ∈ Ind(A) for every term t in q 0 . However, even
when all terms are mapped to ABox individuals, the paths between them may need to
pass by the anonymous part in order to satisfy the regular expressions in the query. To
handle this problem, we define a relaxed notion of query entailment, which exploits
the fact that if all variables are mapped to Ind(A), only loops (that is, paths from an
individual a to itself in IA,T |a ) may participate in the paths between them. Hence, we
look for paths in the ABox that may use such loops to skip states in the query automata.
     Thus, as part of our evaluation procedure, we will need to decide for a given in-
dividual a whether a is an L(αs,s0 )-successor of itself in IA,T |a . We cannot use the
table Loopα directly, since it does not take into account the concepts which are en-
tailed due to ABox assertions. We note however that the set of loops starting from a
given individual is fully determined by the set of concepts in BC which the individ-
ual satisfies. We thus introduce a new table ALoopα such that ALoopα [s, s0 ] contains
all subsets G ⊆ BC such that a is an L(αs,s0 )-successor of itself in IA,T |a whenever
G = {C ∈ BC | a ∈ C IT ,A }. Note that the size of ALoopα is exponential in |T |, but
the associated decision problem is in P:
Lemma 5. It can be decided in polytime in |T | and |α| whether G ∈ ALoopα [s, s0 ].

Definition 2. We write T , A |≈ q if there is a mapping π from the terms in q to Ind(A)
such that:

(a) π(c) = c for each c ∈ NI ,
(b) {A ∈ NC | T , A |= A(π(t))} |= β for each atom β(t) in q, and
(c) for each α(t, t0 ) ∈ q with α = hS, Σ, δ, s, F i, there is a sequence (a0 , s0 ), . . .
    (an , sn ) of distinct pairs from Ind(A) × S such that a0 = π(t), an = π(t0 ), s0 = s,
    sn ∈ F , and for every 0 ≤ i < n, one of the following holds:
      (i) ai = ai+1 and {C ∈ BC | T , A |= C(ai )} ∈ ALoopα [si , si+1 ]
     (ii) there is some (Γ, Υ ) ∈ Σ with (si , (Γ, Υ ), si+1 ), {R ∈ NR | T , A |=
           R(ai , ai+1 )} |= Γ and {A ∈ NC | T , A |= A(ai+1 )} |= Υ .

Lemma 6. T , A |= q if and only if T , A |≈ q 0 for some q 0 ∈ rewrite(q, T ).

    Using the preceding lemma, we can derive our upper bounds:

Proposition 2. Boolean `-C2RPQ entailment is

 1. NLS PACE in data complexity for DL-LiteR and DL-LiteRDFS ;
 2. P in data complexity for ELH;
 3. NP in combined complexity for DL-LiteRDFS ;
 4. PS PACE in combined complexity for DL-LiteR and ELH.

Proof. By Lemmas 4 and 6, we can reduce T , A |= q to deciding whether T , A |≈ q 0
for some q 0 ∈ rewrite(q, T ). For items 1 and 2, if T and q are fixed, then computing
rewrite(q, T ) requires only constant time in |A|. To decide whether T , A |≈ q 0 for
q 0 ∈ rewrite(q, T ), we guess a mapping π from the terms in q 0 to Ind(A) and verify
that it satisfies the conditions in Definition 2. Note that for condition (c), we cannot keep
the whole sequence (a0 , s0 ), . . . (an , sn ) in memory at once, so we use a binary counter
that counts up to Ind(A) × |S| and store only one pair of nodes (ai , si ), (ai+1 , si+1 )
at a time. To verify conditions (b) and (c)(ii) we need checks of the form {R ∈ NR |
T , A |= R(a, b)} |= Γ and {A ∈ NC | T , A |= A(a)} |= Υ . Each one amounts
to a fixed number of instance checks (one for each symbol in Γ or Υ ), hence the data
complexity of these checks is the same as for instance checking in the corresponding
DL: in AC0 for DL-LiteR , and in P for ELH. This yields the desired upper bounds:
NLS PACE for the former, and NLS PACE P =P for the latter.
    For statement 4, instead of building the whole set rewrite(q, T ), which can be ex-
ponential, we generate a single q 0 ∈ rewrite(q, T ) non-deterministically. More pre-
cisely, we take the initial query q and apply a sequence of rewriting steps to obtain
some q 0 ∈ rewrite(q, T ). By Lemma 3, every query in rewrite(q, T ) can be generated
after at most exponentially many steps, so we can use a polynomial-sized counter to
check when we have reached this limit. Since each rewritten query is of polynomial
size (Lemma 3), and we keep a single query in memory at a time, the generation of
a single query in rewrite(q, T ) requires only polynomial space. Then we can use the
same strategy as above to decide in polynomial space whether T , A |≈ q 0 . We thus
have a non-deterministic polynomial space procedure for deciding T , A |= q. Using
the well-known fact that NPS PACE =PS PACE, we obtain the desired upper bound.
    For statement 3, we note that if T is an DL-LiteRDFS TBox, then the query cannot
be rewritten, i.e. rewrite(q, T ) = {q}. Thus, it suffices to decide whether T , A |≈ q. We
then remark that the procedure described above is in NP, since we guess a (polysize)
mapping π and verify in polytime that π satisfies the conditions of Definition 2.         t
                                                                                          u


6    Future Work
In future work, we plan to study what types of restrictions on the TBox and query
lead to better combined complexity. We also wish to explore other types of path-based
query languages. One interesting extension which has been recently proposed for graph
databases [3] is the addition of path variables. In Boolean queries, these prove useful
for speaking about equality of paths. For example, one might want to find whether there
is a chain of advisors of the same length between both Edmund Clarke and Bernoulli,
and Jack Minker and Bernoulli. Things become even more interesting if we allow path
variables in the output. By outputting the paths for the preceding query, we could dis-
cover that Clarke and Minker have a path to Bernoulli of length 12. We could even
output the common areas of expertise of the scientists along the path to find out that
both have an 8-step path to some physicist (Poisson and Fourier), that in turn reach
Bernoulli via an important figure in analysis (Lagrange) and a graph theorist (Euler).


References
 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)
 2. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In: Proc. of IJCAI. pp. 364–369
    (2005)
 3. Barceló, P., Hurtado, C.A., Libkin, L., Wood, P.T.: Expressive languages for path queries
    over graph-structured data. In: Proc. of PODS. pp. 3–14 (2010)
 4. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
    and efficient query answering in description logics: The DL-Lite family. Journal of Auto-
    mated Reasoning 39(3), 385–429 (2007)
 5. Calvanese, D., Eiter, T., Ortiz, M.: Answering regular path queries in expressive description
    logics: An automata-theoretic approach. In: Proc. of AAAI. pp. 391–396 (2007)
 6. Calvanese, D., Eiter, T., Ortiz, M.: Regular path queries in expressive description logics with
    nominals. In: Proc. of IJCAI. pp. 714–720 (2009)
 7. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Data complexity of
    query answering in description logics. In: Proc. of KR. pp. 260–270 (2006)
 8. Consens, M.P., Mendelzon, A.O.: Graphlog: a visual formalism for real life recursion. In:
    Proc. of PODS. pp. 404–416 (1990)
 9. Eiter, T., Ortiz, M., Šimkus, M., Tran, T., Xiao, G.: Query rewriting for Horn-SHIQ plus
    rules. In: Proc. of AAAI (2012)
10. Kozen, D.: Lower bounds for natural proof systems. In: Proc. of FOCS. pp. 254–266 (1977)
11. Ortiz, M., Rudolph, S., Simkus, M.: Query answering in the horn fragments of the description
    logics SHOIQ and SROIQ. In: Proc. of IJCAI. pp. 1039–1044 (2011)