=Paper= {{Paper |id=None |storemode=property |title=The Combined Approach to OBDA: Taming Role Hierarchies using Filters |pdfUrl=https://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper2.pdf |volume=Vol-943 |dblpUrl=https://dblp.org/rec/conf/semweb/LutzSTW12 }} ==The Combined Approach to OBDA: Taming Role Hierarchies using Filters== https://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper2.pdf
    The Combined Approach to OBDA: Taming
          Role Hierarchies using Filters

       Carsten Lutz1 , İnanç Seylan1 , David Toman2 , and Frank Wolter3
                            1
                              Universität Bremen, Germany
                     {clu,seylan}@informatik.uni-bremen.de
              2
                Cheriton School of CS, University of Waterloo, Canada
                               david@cs.uwaterloo.ca
                      3
                        University of Liverpool, United Kingdom
                               wolter@liverpool.ac.uk



      Abstract. There are several approaches to implementing query answer-
      ing over instance data in the presence of an ontology that target conven-
      tional relational database systems (SQL databases) as their back-end.
      They all share the limitation that, for ontologies formulated in OWL2
      QL or versions of the description logic DL-Lite that admit both role hi-
      erarchies and inverse roles, it seems impossible to avoid an exponential
      blowup of the query (and sometimes this is even provable). We consider
      the combined approach and propose to replace its query rewriting part
      with a filtering technique. This is natural from an implementation per-
      spective and allows us to handle both inverse roles and role hierarchies
      without an exponential blowup. We also carry out an experimental eval-
      uation that demonstrates the scalability of this approach.


1   Introduction

In recent years, ontology-based data access (OBDA) has emerged as a promising
and challenging application of ontologies. The idea is to enrich instance data with
a ‘semantic layer’ in the form of an ontology, used as an interface for querying
and to derive additional answers. A central research problem in this area is to
design query answering engines that can deal with sufficiently expressive ontology
languages yet scale to large data sets. The most popular ontology languages that
have been considered for OBDA include the three profiles OWL2 RL, OWL2 QL,
and OWL2 EL, as well as various description logics and datalog variants related
to these profiles [2, 3, 5, 11, 15].
    Currently, there are two major methodologies for answering queries in an
OBDA setting: rewriting-based approaches (also called backward chaining) and
materialization-based approaches (also called forward chaining). In the former,
one compiles the ontology T and the query q into a new query qT that contains
the relevant knowledge from the ontology, i.e., the answers to q over A and T
coincide with the answers to qT over A. One can thus store A in a relational
database management system RDBMS and execute qT over A. In materialization




                                        16
2       The Combined Approach to OBDA: Taming Role Hierarchies using Filters

approaches, the instance data A is completed with the relevant knowledge from
the ontology T , i.e. for any query q, the answers given to q over A and T coincide
with the answers given to q over the completed instance data AT ⊇ A without
any ontology. Thus, one can store AT in a relational database management
system (RDBMS) and execute q over AT .
    A technical problem that arises in materialization approaches is that the
completed data AT easily becomes infinite; in particular, this may happen when
the ontology expresses cyclic dependencies and has existential quantifiers in the
heads of its concept inclusions, which is allowed in most ontology languages in-
cluding the ones mentioned above. To overcome this problem, an economic way
of reusing individuals introduced for existential quantifiers has been proposed
in [8, 9] for the case where ontologies are formulated in description logics from
the EL and DL-Lite families, which are the logical cores of the OWL2 EL and
OWL2 QL ontology languages. While the resulting completed data sets are finite
and give correct answers to instance queries, they can give spurious answers to
conjunctive queries (CQs). To recover soundness for CQs, it is thus necessary to
include an additional step, resulting in the combined approach to query answer-
ing: the original query is rewritten in a way that eliminates spurious answers.
In sharp contrast to pure rewriting, the auxiliary query rewriting required in
the combined approach turns out to be rather simple—an additional selection
condition applied to the results of the original CQ over the completed data—
and often of polynomial size. Indeed, experiments indicate that the combined
approach admits very efficient query execution for expressive variants of EL and
DL-Lite [8, 9].
    Unfortunately, there are certain combinations of logical operators that are
important from an application perspective, but for which an exponential blowup
of the query seems to be unavoidable both in the query rewriting approach and
in the combined approach. In particular, this is the case for the combination
of inverse roles and role hierarchies as found in DL-LiteR [3], the extension of
basic DL-Lite with role hierarchies that underpins OWL2 QL. It has been shown
that, in the query rewriting approach, an exponential blowup of the query size is
unavoidable when the ontology is formulated in DL-LiteR [7]. For the combined
approach, an auxiliary query rewriting strategy for DL-LiteR ontologies and CQs
is presented in [8], but it incurs an exponential blowup and it seems unlikely that
the rewriting can be improved to a poly-sized one (although this question is yet
to be resolved).
    In this paper, we present a new variation on the combined approach that can
handle CQs and DL-LiteR ontologies and eliminates the need for auxiliary query
rewriting altogether, thus also eliminating the need to deal with exponentially
sized queries. Specifically, we replace auxiliary query rewriting with a filtering
component: spurious answers are eliminated by a polynomial-time filtering pro-
cedure (called a filter in the rest of the paper) that is installed as a user-defined
function in the underlying RDBMS. Our main contributions are as follows.
(1) We develop a polynomial time procedure for filtering out spurious answers to
CQs for ontologies formulated in DL-LiteR . Interestingly, the existence of such




                                        17
                  Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter     3

a filtering procedure appears to be quite sensitive to how exactly the instance
data is completed. Compared to the data completion for the original combined
approach [8], the filtering technique requires subtle modifications to the data
completion in order to obtain a polytime filter.
(2) To analyze the performance of our approach and to compare it with the query
rewriting approach, we modify the Lehigh University Benchmark (LUBM) [6] by
introducing additional existential restrictions and subconcepts into the LUBM
ontology and incompleteness into the LUBM Data Generator. Additional queries
that are designed to stress-test the performance of the filtering procedure are
introduced as well.
(3) Finally, the resulting ontologies, data, and queries are used to demonstrate
the feasibility of our approach, and to show that it scales to large amounts of
instance data.
Some technical proofs and details of our experimental evaluation are presented in
the appendix of the full version of this paper, available at http://informatik.
uni-bremen.de/~clu/combined/


2     Preliminaries

We introduce DL-LiteR -TBoxes, ABoxes, and conjunctive queries. Let NI , NC ,
and NR be countably infinite sets of individual names, concept names and role
names. Roles R, simple concepts C, and concepts D are built according to the
following syntax rules, where P ranges over NR and A over NC :
        R ::= P | P − ,          C ::= A | ∃R,           D ::= C | ¬C | ∃R.A.
As usual, we use N−                                                − −
                  R to denote the set of all roles and identify (P )   with P .
In DL-LiteR , a TBox is a finite set T of concept inclusions (CIs) C � D with
C a simple concept and D a concept, and role inclusions (RIs) R1 � R2 with
R1 , R2 roles.1
   An ABox is a finite set of concept assertions A(a) and role assertions P (a, b),
where A ∈ NC , P ∈ NR and a, b ∈ NI . We denote by Ind(A) the set of individual
names that occur in A, and write P − (a, b) ∈ A instead of P (b, a) ∈ A whenever
convenient. A knowledge base (KB) is a pair (T , A) with T a TBox and A an
ABox.
    The semantics of TBoxes and ABoxes is defined in the standard way based
on interpretations I = (∆I , ·I ), where ∆I is a non-empty domain and ·I an
interpretation function that maps each A ∈ NC to a subset AI ⊆ ∆I , each
P ∈ NR to a relation P I ⊆ ∆I × ∆I , and each a ∈ NI to an element aI ∈ ∆I ; for
details consult [1, 3]. An interpretation is a model of a TBox T if it satisfies all
inclusions in T ; models of ABoxes and knowledge bases are defined analogously.
A knowledge base is consistent if it has a model. For a CI or RI α, we write
T |= α when α is a consequence of T (satisfied in all models of T ). Instead of
1
    A set of role inclusions is also called a role hierarchy.




                                             18
4       The Combined Approach to OBDA: Taming Role Hierarchies using Filters

T |= R � S, we will usually write R �∗T S to clearly distinguish consequences of
this form (which are RIs) from consequences of the form T |= ∃R � ∃S (which
are CIs). Note that, in DL-LiteR , deciding consistency and logical consequence
amounts to computing a form of transitive closure [3].
    Let NV be a countably infinite set of variables. Taken together, the sets NV
and NI form the set NT of terms. A first-order (FO) query is a first-order formula
q = ϕ(x) in the signature NC ∪ NR and with terms from NT , where the concept
and role names are treated as unary and binary predicates, respectively, and x
are the free variables of ϕ called the answer variables; we say that q is k-ary if
x comprises k variables. If k = 0, then q is a Boolean query. A conjunctive query
(CQ) is an FO query of the form q = ∃y ψ(y, x), where ψ is a conjunction of
concept atoms A(t) and role atoms P (t, t� ) where t, t� ∈ NT . As in the case of
ABox assertions, we do not distinguish between P − (t, t� ) and P (t� , t). We denote
by term(q) the set of terms in q.
    Let q = ϕ(x) be a k-ary FO query with x = x1 , . . . , xk , and I an interpre-
tation. A mapping π : term(q) → ∆I with π(a) = aI for all a ∈ term(q) ∩ NI is
a match for q in I if I satisfies q under the variable assignment that maps each
t ∈ term(q) to π(t); in this case, we write I |=π q. For a k-tuple of individual
names a = a1 , . . . , ak , a match π for q in I is an a-match if π(xi ) = aIi for
i ≤ k. We say that a is an answer to q in an interpretation I if there is an
a-match for q in I and use ans(q, I) to denote the set of all answers to q in I.
Finally, a is a certain answer to q over a KB K = (T , A) if a ⊆ Ind(A) and
I |= q[a] for all models I of K. The set of all certain answers to q over K is
denoted by cert(q, K). The query answering problem considered in this paper is:
given a DL-LiteR knowledge base K and a CQ q, compute cert(q, K).
    To simplify notation, throughout the paper we adopt the unique name as-
sumption (UNA), i.e., require that aI �= bI for distinct a, b ∈ NI . This assumption
has no impact on the query answering problem.


3   ABox Completion
As explained in the introduction, the central idea of the combined approach is to
materialize consequences of the TBox in the ABox as a preprocessing step, and
then to execute queries over the completed data stored in an RDBMS as a plain
table. We illustrate this using two examples from the university domain, similar
in spirit to the LUBM ontology from [6] used in the experimental evaluation.
Example 1. For any ABox A, the concept inclusions
                              Student � Person                                   (1)
                              Student � ∃takesCourse                             (2)
lead to the following additions: (1) for every assertion Student(a) ∈ A, add
(1) Person(a) and (2) takesCourse(a, b) for some fresh individual b (unless such
assertions are already present). After this completion, a CQ such as
                     q1 (x) = ∃y Person(x) ∧ takesCourse(x, y)




                                        19
               Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter       5

                                           Faculty
                                             a

                                                degreeFrom

                                               b Univ
                            degreeFrom               deptOf

                                   c                 d
                               Faculty    teachesAt Dept


                     Fig. 1. Completed ABox for Example 3.


correctly returns each a with Student(a) ∈ A as a certain answer.
The following example shows that naive completion can result in infinite ABoxes.
Example 2. Completed naively, the ABox {Faculty(a)} and LUBM inclusions
              Faculty � ∃degreeFrom            ∃degreeFrom− � Univ             (3)
                                  −
                 Univ � ∃deptOf                      ∃deptOf � Dept            (4)
                                      −
                 Dept � ∃teachesAt               ∃teachesAt � Faculty          (5)
result in an infinite role chain that indefinitely repeats the roles degreeFrom,
deptOf − , and teachesAt− .
The problem can be overcome by reusing fresh individuals in an economic way.
Example 3. Consider again the TBox (3)-(5). By reusing individuals, the ABox
{Faculty(a)} can be completed as shown in Figure 1, replacing the infinite role
chain with a cycle. Individual reuse compromises soundness of query answering
as some queries now have spurious answers; for example, the CQ
            q2 (x) = ∃y, z Faculty(x) ∧ degreeFrom(x, y) ∧ Univ(y) ∧
                          deptOf(z, y) ∧ Dept(z) ∧ teachesAt(x, z)
returns c as an answer when executed over the completed ABox shown in Fig-
ure 1. This answer is spurious for two reasons: first, the cycle in Figure 1 is
present only due to individual reuse and thus should be disregarded for query
matches; and second, the freshly introduced individuals b, c, d are ‘labeled nulls’
and thus can never be returned as answers.
To recover soundness, it is necessary to eliminate the spurious answers. In the
original combined approach, this was achieved by query rewriting [8, 9]. In this
paper, the spurious answers are eliminated by a filtering procedure that is in-
stalled as a user-defined function in the RDBMS. In the remainder of this sec-
tion, we introduce ABox completion in full detail. In the subsequent section, we
describe the filtering procedure.
    From a conceptual perspective, the ABox completion step can be viewed
as replacing the original ABox with the canonical model IK of the knowledge




                                          20
6         The Combined Approach to OBDA: Taming Role Hierarchies using Filters

base K [8]. To define IK , we need a few preliminaries. From now on, we will
generally disallow concepts of the form ∃R.C. This can be done without loss of
generality since each CI D � ∃R.C can be replaced with D � ∃RC , RC � R,
        −
and ∃RC   � C, where RC is a fresh role name.
    Let K = (T , A) be a DL-LiteR KB. We use rol(T ) to denote the set of all
role names in T plus their inverses. The canonical model comprises at most two
fresh individuals for every role in rol(T ). However, we only want to introduce
the fresh individuals for a given role when necessary. Formally, we call a role
R ∈ rol(T ) generating in T if there exist an a ∈ Ind(A) and R0 , . . . , Rn ∈ rol(T )
such that Rn = R and the following conditions hold:

                                  / A for all b ∈ Ind(A) (written a � ∃R0 ),
(agen) K |= ∃R0 (a) and R0 (a, b) ∈
(rgen) for i < n, T |= ∃Ri− � ∃Ri+1 and Ri− �= Ri+1 (written ∃Ri− � ∃Ri+1 ).

To facilitate the implementation of efficient filters, we refine the definition of
canonical models as given in [8]: in some cases, we introduce two fresh individ-
uals for a given role instead of only a single one. The need for the additional
individual is related to particular role configurations in the TBox called a loop:
a set {R, S} ⊆ rol(T ) (where potentially R = S) is a loop in T if R �= S − ,
T |= ∃R− � ∃S, T |= ∃S − � ∃R, and there is some T ∈ rol(T ) such that
S − �∗T T and R �∗T T . Let LT denote the set of all roles that occur in a loop
in T . The canonical model IK is then based on the domain
            ∆IK = Ind(A) ∪ {cR,0 | R ∈ rol(T ) \ LT is generating in K}
                         ∪ {cR,0 , cR,1 | R ∈ LT is generating in K}.
To define the extension of roles in IK , we need some additional preparation. Let
“≺” be an arbitrary, but fixed total ordering on rol(T ). For all d, d� ∈ ∆IK and
each role R, we write d �R d� whenever there is an S such that S �∗T R and
one of the following cases applies:

    – d = a ∈ Ind(A), a � ∃S, and d� = cS,0 ;
    – d = cT,i , ∃T − � ∃S, d� = cS,j , and one of the following holds
        • i = j and {S, T } is not a loop in T ;
        • i = j, {S, T } is a loop in T , and S ≺ T ;
        • i = j, {S, T } is a loop in T , and T = S or T ≺ S (for 0 = 1 and 1 = 0).

The canonical model IK for K is now defined as follows, based on the domain
∆IK introduced above:
        AIK = {a ∈ Ind(A) | K |= A(a)} ∪ {cR,i ∈ ∆IK | T |= ∃R− � A},
        RIK = {(a, b) ∈ Ind(A) × Ind(A) | ∃S : S(a, b) ∈ A and S �∗T R} ∪
                 {(d, d� ) ∈ ∆IK | d �R d� or d� �R− d},
        aIK = a.
The ABox completion consists of replacing the ABox A originally stored in the
RDBMS with its canonical model IK . This can be achieved by executing a set
of FO/SQL-queries whose size is polynomial in the size of T [8].




                                         21
                Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter                      7


                                       a                                   a
                                                                       w       i
                           i
                               w                                           acw,0
                                   p                                   p
                    cw,0                           cp,0                      i
                                   i w
                                                                           acw,0 cp,0
                                               i
                                           w                           w     i
                                   p i                                      ac c c
                    cw,1                           cp,1                    .. w,0 p,0 w,1
                                   i                                        .

   Fig. 2. Canonical model IK and unraveled canonical model UK for Example 5.


    It can be shown that IK is a model of K whenever K is consistent. Note that
one can find a Boolean CQ qT of size polynomial in the size of T such that for
any ABox A stored in the RDBMS, qT gives a positive answer iff K = (T , A)
is consistent [8]. We can thus safely assume that the knowledge base has been
tested for consistency before query answering.

Example 4. Reconsider Examples 2 and 3. The canonical model IK for the ABox
{Faculty(a)} and TBox (3)-(5) is the structure displayed in Figure 1. Follow-
ing our construction above, the fresh individuals b, c, d are named cdegreeFrom,0 ,
cteachesAt− ,0 , and cdeptOf − ,0 . Note that the TBox (3)-(5) does not give rise to any
loops, and thus all cR,i have index i = 0.

Example 5. The following TBox gives rise to the loop {worksFor, paysSalaryOf}:

        Employee � ∃worksFor                                    ∃worksFor− � Employer           (6)
                                                                            −
        Employer � ∃paysSalaryOf                          ∃paysSalaryOf            � Employee   (7)
                −
       worksFor � isAffiliatedWith                               paysSalaryOf � isAffiliatedWith.   (8)

The canonical model IK for the ABox {Employee(a)} and the TBox (6)-(8) with
paysSalaryOf ≺ worksFor is shown on the left-hand side of Figure 2, where concept
names are omitted and role names are abbreviated by their first letter.

Note that a more straightforward version of the canonical model could be ob-
tained by identifying all elements cR,0 and cR,1 . Section 4 explains why this leads
to problems for efficient filtering. Also note that real world ontologies seem to
contain only very few loops, if any, and thus having two domain elements per role
that participates in a loop should not significantly increase the size of canonical
models in practical cases.
    To characterize the spurious answers that have to be filtered out, it is useful
to introduce an unraveled (infinite) version of canonical models. Let K be a
knowledge base. A path is a finite sequence ad1 · · · dn , n ≥ 0, such that a ∈
Ind(A), d1 , . . . , dn ∈ ∆IK \ Ind(A), a �R d1 for some R ∈ N−     R , and di �R di+1
for some R ∈ N−     R , 1 ≤ i < n. We denote by tail(σ) the last element  of the path σ.




                                                          22
8         The Combined Approach to OBDA: Taming Role Hierarchies using Filters

The unraveled canonical model UK is then defined by taking:
                          ∆UK is the set of all paths in IK ,
                          aUK = a, for all a ∈ Ind(A),
                          AUK = {σ ∈ ∆UK | tail(σ) ∈ AIK },
         RUK = {(a, b) ∈ Ind(A) × Ind(A) | ∃S : S(a, b) ∈ A and S �∗T R} ∪
               {(σ, σd) | σd ∈ ∆UK and tail(σ) �R d} ∪
               {(σd, σ) | σd ∈ ∆UK and tail(σ) �R− d}.
As an example, the canonical model UK for the KB from Example 4 is shown on
the right-hand side of Figure 2. The following result shows that, as one would
expect, UK does not suffer from spurious answers.
Theorem 1. For every consistent DL-LiteR -KB K and every CQ q, we have
cert(q, K) = ans(q, UK ).
The proof of Theorem 1 is standard and omitted, see [8] for a similar proof.


4      Filtering
To remove spurious answers, we install a filtering procedure as a user-defined
function in the RDBMS. The procedure takes as input a match of the query in
the canonical model IK stored in the RDBMS and returns “false” if this match
is spurious and “true” otherwise. We assume that the filtering procedure has
access to the query and the TBox, but not to the data. To define its behavior
more precisely, we formally define spurious matches based on unraveled canonical
models UK and Theorem 1.
    Let K be a KB and q(x) a CQ. A match π of q in IK is reproduced by a
match τ of q in UK if for all t ∈ term(q), we have π(t) = tail(τ (t)). We say that
π is spurious if it is not reproduced by any match τ of q in UK . The following
lemma, which is an immediate consequence of Theorem 1, shows that IK can be
used for query answering when spurious matches are filtered out.
Lemma 1. a ∈ cert(q, K) iff there is an a-match π of q in IK that is not
spurious.
We want to show that it can be decided in time polynomial in the size of q and
T whether a given match in IK is spurious. Clearly, it is enough to test for each
maximally connected component of q whether the given match is spurious on
that component. We can thus w.l.o.g. assume that q is connected.
     We need a few preliminaries. An anonymous path is a path without the
leading individual name, i.e., it is a finite sequence d1 · · · dn , n ≥ 1, such that
d1 , . . . , dn ∈ ∆IK \ Ind(A) and di �R di+1 for some R ∈ N−   R , 1 ≤ i < n. We use
Paths to denote the set of all paths, both anonymous and non-anonymous. A root
configuration for q given π is a set ρ ⊆ term(q) such that one of the following
conditions is true:
    – ρ is the set of those t ∈ term(q) such that π(t) ∈ NI and this set is non-empty;




                                          23
                 Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter                9

                           Student Student                    Student
                                a 1 a2              ···          an

                            takesCourse                    takesCourse
                                                 ctakesCourse,0

                       Fig. 3. Canonical model IK for Example 6.


 – the above set is empty and ρ contains exactly one term (actually a variable).
The filtering procedure first checks whether all answer variables are mapped to
elements of Ind(A), i.e., individuals of the original ABox A. If this is not the
case, it immediately returns “false”. Then the procedure iterates through all
root configurations ρ. For each ρ, it constructs a sequence Sρ0 , Sρ1 , . . . of relations
Sρi ⊆ term(q) × Paths as follows:
 – Sρ0 contains all pairs (t, π(t)) with t ∈ ρ;
 – Sρi+1 is Sρi extended with the following pairs:
   (a) (t, σπ(t)) for all R(s, t) ∈ q with (s, σ) ∈ Sρi and π(s) �R π(t);
   (b) (t, σπ(t)) for all R(s, t) ∈ q with (s, σπ(t)π(s)) ∈ Sρi and π(t) �R− π(s).
The computation stops as soon as the sequence stabilizes or Sρi becomes non-
functional which happens after at most |term(q)| iterations. The procedure re-
turns “true” if the final Sρi is a function with domain term(q) for some root
configuration ρ, and “false” otherwise.
Example 6. Consider the TBox (1)-(2) from Example 1, the query
 q3 (x, y) = ∃z Student(x) ∧ Student(y) ∧ takesCourse(x, z) ∧ takesCourse(y, z), (9)
and the ABox
                             {Student(a1 ), . . . , Student(an )}.                        (10)
The canonical model IK is shown in Figure 3. Suppose the filter gets as input
the match π = {x �→ a1 , y �→ a2 , z �→ ctakesCourse,0 }. There is only one possible
root configuration for π, which is ρ = {x, y}. The procedure computes
            Sρ = {(x, a1 ), (y, a2 ), (z, a1 ctakesCourse,0 ), (z, a2 ctakesCourse,0 )}
which is not a function; thus, the match is identified as spurious and “false” is
returned.
Example 7. Consider the ABox {Faculty(a)}, TBox (3)-(5), and query q2 from
Example 3. To make things a bit more interesting, assume that x is a quantified
variable in q2 rather than an answer variable. Recall that the canonical model IK
is shown in Figure 1, modulo the names of fresh individuals. Given the match
π = {x �→ c, y �→ b, z �→ d} and considering the root configuration ρ = {x}, the
procedure computes
                         Sρ = {(x, c), (y, cb), (z, cbd), (x, cbdc)}
and stops because of non-functionality. For the other root configurations ρ = {y}
and ρ = {z}, the procedure fails in a similar way and thus returns “false”.




                                               24
10       The Combined Approach to OBDA: Taming Role Hierarchies using Filters

Similar to the “tree witnesses” from [8], the filtering procedure follows a simple
idea for reproducing the input match π in IK as a match τ in UK : when we have
already decided that τ (x) = σ ∈/ Ind(A) and R(x, y) ∈ q, then there is a uniquely
determined individual σ � to which y can be matched. This follows from requiring
π(y) = tail(τ (y)) and the following property of UK :
     if (σ, σ � ) ∈ RUK and (σ, σ �� ) ∈ RUK with σ � �= σ �� , then tail(σ � ) �= tail(σ �� ).
In fact, it is this determinism of matches that is made explicit by Conditions (a)
and (b) of the filtering procedure. Note that, without introducing two individual
names cR,0 and cR,1 whenever R is involved in a loop, the above crucial property
of UK fails. In fact, we do not know whether polytime filtering is possible based
on the variation of the canonical model where all individuals cR,0 and cR,1 are
identified. The problem is illustrated by the following example.
Example 8. Consider the ABox {Employee(a)} and TBox (6)-(8) from Exam-
ple 5 and the CQ
                        q4 (x) = ∃y, z, u w(x, y) ∧ p(y, z) ∧ i(u, z).
Let π = {x �→ a, y �→ cw,0 , z �→ cp,0 , u �→ cw,1 }. The only root configuration is
ρ = {x}. During the first two iterations, the filtering procedure produces
                          Sρ2 = {(x, a), (y, acw,0 ), (z, acw,0 cp,0 )}.
Sρ2 says that z has to be mapped to acw,0 cp,0 . Due to the atom i(z, u) ∈ q4 and
the two i-edges outgoing from z in UK , the possible targets for u are acw,0 and
acw,1 . However, to produce a match in UK that is compatible with π, we can only
choose a target that ends with π(u) = cw,1 and obtain
                Sρ3 = {(x, a), (y, acw,0 ), (z, acw,0 cp,0 ), (z, acw,0 cp,0 cw,1 )}
which is functional, showing that the match π is not spurious. In a canonical
model IK where cw,0 and acw,1 are identified, there are indeed two choices for
mapping of u. This makes it non-obvious how to find a polytime filtering proce-
dure in this case, if one exists at all.
We now analyze the runtime and correctness of the filtering procedure. First
note that, in Conditions (a) and (b), the filtering procedure has to check whether
π(s) �R π(t) and π(t) �R− π(s), respectively. As required, both conditions can
be tested without access to the ABox A. For example, in Condition (a) we have:
 – if π(t) ∈ Ind(A), then π(s) �R π(t) does not hold and checking whether
   π(t) ∈ Ind(A) requires only to check whether or not π(t) is of the form cR,i ;
 – if π(s) ∈ Ind(A) and π(t) ∈ / Ind(A), then π(s) �R π(t) holds by the con-
   struction of IK since π is a match of q in IK , and;
 – if π(s) ∈
           / Ind(A) and π(t) ∈ / Ind(A), then π(s) �R π(t) can be checked by
   using only π and T based on the definition of “�R ”.
It is not hard to see that the algorithm runs in polynomial time. The runtime
is quadratic in the size of q because we first have to iterate over all root con-
figurations ρ and then need to compute Sρ , essentially a breadth-first search




                                                25
               Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter        11

of (the graph of) q. We conjecture that iterating over all root configurations
is avoidable at the cost of a less transparent filtering procedure, improving the
runtime to linear in the size of q. The runtime also depends on T as checking
the applicability of Conditions (a) and (b) involves testing consequences of the
forms T |= ∃R � ∃S and S �∗T R. Since it is efficient to pre-compute all these
consequences in practical cases, this amounts to a simple lookup.
   The following lemma asserts correctness of the filtering procedure. It is proved
in Appendix A of the full version.
Lemma 2. Given a match π of q in IK , the filtering procedure returns “true”
iff π is not spurious.


5     Experiments

We carry out an experimental evaluation to analyze the performance of the com-
bined approach with filtering, based on the IBM DB2 relational database system.
We use a modified version of the ontology from the Lehigh University Benchmark
(LUBM) [6] and produce test ABoxes using a modified version of the LUBM data
generator. We execute five queries from the LUBM suite as well as six hand-
crafted ones that were designed to test the proposed approach more fully. Note
that LUBM was not specifically designed for evaluating OBDA with DL-LiteR
and in its original form is not too useful for this purpose, for reasons explained in
more detail below. Our modifications of the LUBM ontology, data generator, and
query set all aim at making the LUBM suite more realistic for OBDA evaluation.
We believe that this material might be interesting also for future experiments
and provide it online at http://informatik.uni-bremen.de/~clu/combined/.


5.1   Ontology, Data, and Queries

The LUBM ontology comprises 42 concept names and 25 role names and is
formulated in the description logic ELI extended with transitive roles, role hier-
archies, and datatypes. The TBox contains concept inclusions of the form A � C,
concept definitions A ≡ C as abbreviations for A � C, C � A, and domain and
range restrictions of the form ∃R � A and ∃R− � A. We converted this on-
tology to DL-LiteR by dropping all datatypes, treating the only transitive role
subOrganizationOf as a standard role, replacing concept equations A ≡ C with
A � C, and breaking up conjunctions A � C1 � C2 into A � C1 , A � C2 .
    While the resulting TBox is formulated in DL-LiteR as required, it is only
moderately interesting for evaluating query answering techniques: first, there is
a lack of existential restrictions ∃R and ∃R.C on the right-hand side of con-
cept inclusions, which leads to extremely few fresh anonymous individuals being
generated during the ABox completion, and consequently to very few role edges
between those individuals (from now on, we call this part of the canonical model
IK the anonymous part); second, the overall size of the TBox is too small to be
representative for real-world ontologies. To attenuate these deficiencies while still




                                        26
12     The Combined Approach to OBDA: Taming Role Hierarchies using Filters

being able to use the LUBM data generator, we extended the DL-LiteR -version
of LUBM in two directions:
(1) We added 26 concept inclusions, many of which have existential restrictions
on the right-hand side, to generate a more interesting anonymous part of canon-
ical models. A complete list of these CIs can be found in Appendix B of the full
version of this paper.
(2) With reasonable effort, it does not seem possible to significantly increase the
size of LUBM (to hundreds or thousands of concepts) while retaining a careful
modeling. One particularly unrealistic aspect of LUBM and a striking differ-
ence to more comprehensive ontologies is its limited concept hierarchy, where
each concept has only very few subconcepts. To alleviate this shortcoming, we
added subconcepts to each of the LUBM concepts Course, Department, Professor,
and Student by introducing subject areas, such as MathCourse, BioCourse, and
CSCourse for courses, MathProfessor, BioProfessor for professors, etc.
    We call the resulting TBox LUBM∃n with n indicating the number of sub-
concepts introduced in Point 2 above (20 by default). For example, LUBM∃20
contains 106 concept names and 27 role names.
    To generate ABoxes, we use the LUBM Data Generator (UBA) version 1.7,
modified so as to complement our modifications to the TBox. Specifically, the
original UBA generates data that is complete w.r.t. existential restrictions in the
LUBM ontology: it produces ABoxes A such that for every assertion A(a) ∈ A
and CI A � ∃R (and A � ∃R.B) in LUBM∃n , there is already an r-successor of
a in A. Our modifications introduce a controlled amount of incompleteness: the
modified data generator takes a probability p as a parameter and, in selected
parts of the data, drops generated role assertions with probability p. More infor-
mation can be found in Appendix C of the full version. The second modification
of the data generator is linked to the subconcepts introduced in Point 2 above.
Whenever the original generator produces an instance a of Student, the new
generator randomly chooses a value between 1 and n and generates an asser-
tion for the i-th subject, SubjiStudent(a); similarly for Course, Department, and
Professor.
    The main aim of our experiments is to show that our approach is feasible on
realistic ontologies, data, and queries. Additionally, we also provide a prelimi-
nary comparison with the query rewriting approach, using the Requiem tool for
producing those rewritings [10]. We use 11 queries, six of which we have hand-
crafted specifically for our experiments and five originating from the evaluation
of Requiem presented in [10]. The latter queries are extremely simple and do
in most cases neither pose a serious challenge for the filtering approach nor for
pure rewriting. The former are shown in Figure 4. Note that q3 is very similar
to the query discussed in Examples 3, 4, and 7; and is designed in such a way
that spurious cycles in the anonymous part of canonical models produce spurious
matches that have to be filtered out. Query q2 is essentially the query discussed
in Example 6 and is designed to stress-test the filtering approach: based on the
data generation scheme, it is expected to produce a very large number of spurious
answers.




                                        27
                Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter      13

q1(x,y) <- Student(x), takesCourse(x,z), Course(z), teacherOf(y,z),
            Faculty(y), worksFor(y,u), Department(u), memberOf(x,u)
q2(x,y) <- Subj3Student(x), Subj4Student(y),
           takesCourse(x,z), takesCourse(y,z)
q3(x)   <- Faculty(x), degreeFrom(x,y), University(y),
           subOrganizationOf(z,y), Department(z), memberOf(x,z)
q4(x,y) <- Subj3Department(x), Subj4Department(y),
           Professor(z), memberOf(z,x), publicationAuthor(u,z),
           Professor(v), memberOf(v,y), publicationAuthor(u,v)
q5(x)   <- Publication(x), publicationAuthor(x,y), Professor(y),
           publicationAuthor(x,z), Student(z)
q6(x,y) <- University(x), University(y), memberOf(z,x), Student(z),
           memberOf(u,y), Professor(u), advisor(z,u)

                             Fig. 4. Queries q1 to q6 .


        Universities CA CA (compl) RA RA (compl) Inds Inds (compl)
            10      373K  636K     593K 1.3M     201K    201K
            25      984K   1.6M    1.5M 3.6M     528K    528K
            50       1.9M  3.3M    3.1M  7.2M     1M       1M
            75        3M   5.1M    4.7M 10.9M    1.6M     1.6M
           100        4M   6.8M    6.3M 14.6M    2.1M     2.1M
           125        5M   8.5M    7.9M  18M     2.7M     2.7M
           150        6M  10.1M    9.5M 21.8M    3.2M     3.2M
           200        8M  13.5M   12.6M  29M     4.3M     4.3M
Fig. 5. Number of concepts, roles, and individuals in original and completed ABoxes.


5.2   Results

We report on three experiments, in each experiment varying a different param-
eter: in experiment one, we vary the size of the ABox via the number of univer-
sities generated by the data generator; in experiment two, we vary the degree
of incompleteness of the data; and in experiment three, we vary the number of
subclasses, i.e., the parameter n of the ontology LUBM∃n . All experiments were
carried out on a Linux (3.2.0) machine with a 3.5Ghz quad-core processor and
8GB of RAM, using IBM DB2 version 9.7.5.
    The size of the test data is detailed in Figure 5 and query execution times
for the first experiment are reported in Figure 7. Here we use 5% incompleteness
of the data and n = 20 subclasses in LUBM∃n . The orange curves are for the
combined approach with filtering while the blue ones indicate the pure rewriting
approach for those cases in which Requiem succeeded generating a rewriting.
Using the combined approach with filtering, all queries were answered within
very reasonable time. To better understand the results, it is interesting to con-
sider the number of spurious and valid answers for each query shown in Figure 6
for the 200 universities experiment. Query q2 , designed specifically to stress-test
the filter, as expected produces a huge number of spurious answers. Indeed, the




                                        28
14        The Combined Approach to OBDA: Taming Role Hierarchies using Filters

                          q1 q2 q3 q4 q5 q6 req1 req2 req3 req4 req5
         spurious answers 2 28M 2 24K 0 0 0 22K 0 163K 0
           valid answers 4.6M 0 0 0 8.3M 0 0 410K 48K 137K 0
    Fig. 6. Number of answers for 200 universities, 5% incompleteness, 20 subclasses.


comparably long execution time of this query appears to be mainly due to the
fact that DB2 has to handle a large number of spurious answers before the filter-
ing takes place, and not to a poor performance of the filter itself. The execution
times of q1 and q5 can also be explained by a large number of answers. Note
that the number of filter calls is actually the sum of the numbers of answers,
both spurious and valid. Also note that, in principle, it is possible to avoid an
extremely large number of spurious answers in q2 (and any other query) at the
cost of slightly increasing the size of the canonical model: duplicate the anony-
mous part of the canonical model so that no two individuals in the original
ABox ‘share’ an anonymous part of the canonical model. Analyzing this further
in experiments is left for future work.
    Experiments two and three are reported about in Figures 8 and 9. Here,
we only tested the filtering approach. In both cases, we use 100 universities.
In experiment two, the number of subclasses is fixed to 20 while in experiment
three, the degree of incompleteness is fixed to 5%. In general, the degree of
incompleteness has virtually no effect on the execution time of queries. Again,
q2 is an exception as the number of spurious answers increases dramatically
from 3M for 1% incompleteness to 125M for 20% incompleteness. The number
of subclasses also has essentially no effect on query execution times (in contrast to
the pure query rewriting approach for which a non-trivial number subclasses can
dramatically increase the size of the rewritten query). Note that the execution
time of q2 becomes shorter with an increasing number of subclasses because the
number of spurious matches decreases from 6.5M for 5 subclasses to 211K for
100 subclasses: this is due to the atoms Subj3Student and Subj4Student in q2 and
the fact that the number of assertions for these two concepts decreases as the
number of subject areas increases.


6     Conclusion
We have modified the combined approach to OBDA by replacing the query
rewriting part with a filtering technique. This step is natural from an implemen-
tation perspective and allows to circumvent an exponential blowup of the query.
Based on experiments with an improved version of the LUBM ontology, we have
demonstrated the scalability of our approach.
     As future work, we plan to extend the combined approach with filtering to
other description logics for which, until now, it is unknown how to avoid an
exponential blowup. For example, we believe that polytime filtering is possible
for the extension of EL with transitive roles, as found in the OWL2 EL pro-
file. It would also be interesting to better understand the impact of modifying
the canonical model on query answering, both from a theoretical and from a




                                           29
                         Carsten Lutz, İnanç Seylan, David Toman, and Frank Wolter                                         15




           100
time (s)




            50                                                                               200
                                                                                          150         s
                                                                                                   IV
                                                                                       100       N
             0                                                                                 U
                  q1     q2    q3    q4    q5    q6 req1 req2 req req req           50      of
                                                                 3   4    5               #

                  Fig. 7. Query run times for varying numbers of Universities.




           400
time (s)




           200                                                                                          20
                                                                                               15                        )
                                                                                                                    (%
                                                                                        10                      e
             0                                                                                          p   let
                  q1     q2    q3    q4    q5                                       5               m
                                                 q6 req1 req2 req req req
                                                                 3   4    5    02        i   nc
                                                                                                o


                      Fig. 8. Query run times for varying incompleteness (in %).
time (s)




           20
                                                                                                    100
                                                                                             80                         es
                                                                                                                   ss
            0
                                                                                        60
                                                                                                         b   cla
                 q1     q2    q3    q4    q5    q6 req1 req2 req req req           40               su
                                                                3   4    5    20              of
                                                                                         #

                       Fig. 9. Query run times for varying number of subclasses.




                                                        30
16      The Combined Approach to OBDA: Taming Role Hierarchies using Filters

practical perspective. For example, we do not know whether the more natural
canonical model obtained by identifying all individuals cR,0 and cR,1 admits
polytime filtering. Moreover, as discussed above it is conceivable that versions
of the canonical model that are less economic regarding individual reuse result
in better runtime in practice.
    We also plan to compare the performance of our approach more thoroughly
with the performance of pure query rewriting, using other state-of-the-art query
rewriting tools such as Quest [12], Presto [13], OWLgres [14], CLIPPER [4]. In
this context, it is interesting to note that promising new optimization techniques
have recently been developed in [11] and implemented in the Quest system.
While some of them (such as the exploitation of ABox integrity constraints) aim
specifically at the query rewriting approach, others (such as semantic indexing)
can easily be combined with the filtering approach proposed in this paper.


References
 1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.):
    The Description Logic Handbook: Theory, Implementation and Applications. Cam-
    bridge University Press (2003)
 2. Calı̀, A., Gottlob, G., Pieris, A.: New expressive languages for ontological query
    answering. In: AAAI (2011)
 3. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable
    reasoning and efficient query answering in description logics: The DL-Lite family.
    J. of Automated Reasoning 39(3), 385–429 (2007)
 4. Eiter, T., Ortiz, M., Simkus, M., Tran, T.K., Xiao, G.: Query rewriting for Horn-
    SHIQ plus rules. In: AAAI (2012)
 5. Eiter, T., Ortiz, M., Simkus, M., Tran, T.K., Xiao, G.: Towards practical query
    answering for Horn-SHIQ. In: Description Logics (2012)
 6. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base sys-
    tems. J. Web Sem. 3(2-3), 158–182 (2005)
 7. Kikot, S., Kontchakov, R., Podolskii, V.V., Zakharyaschev, M.: Exponential lower
    bounds and separation for query rewriting. In: ICALP (2). pp. 263–274 (2012)
 8. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com-
    bined approach to query answering in DL-Lite. In: KR (2010)
 9. Lutz, C., Wolter, F., Toman, D.: Conjunctive query answering in the description
    logic EL using a relational database systems. In: IJCAI. pp. 2070–2075 (2009)
10. Pérez-Urbina, H., Horrocks, I., Motik, B.: Efficient query answering for OWL 2.
    In: International Semantic Web Conference. pp. 489–504 (2009)
11. Rodriguez-Muro, M., Calvanese, D.: High performance query answering over DL-
    Lite ontologies. In: KR (2012)
12. Rodriguez-Muro, M., Calvanese, D.: Quest, an OWL 2 QL reasoner for ontology-
    based data access. In: OWLED (2012)
13. Rosati, R., Almatelli, A.: Improving query answering over DL-Lite ontologies. In:
    KR (2010)
14. Stocker, M., Smith, M.: Owlgres: A scalable OWL reasoner. In: OWLED (2008)
15. Thomazo, M., Baget, J.F., Mugnier, M.L., Rudolph, S.: A generic querying algo-
    rithm for greedy sets of existential rules. In: KR (2012)




                                         31