=Paper= {{Paper |id=None |storemode=property |title=Dependencies: Making Ontology Based Data Access Work |pdfUrl=https://ceur-ws.org/Vol-749/paper21.pdf |volume=Vol-749 |dblpUrl=https://dblp.org/rec/conf/amw/Rodriguez-MuroC11 }} ==Dependencies: Making Ontology Based Data Access Work== https://ceur-ws.org/Vol-749/paper21.pdf
    Dependencies: Making Ontology Based Data Access
                   Work in Practice

                     Mariano Rodrı́guez-Muro and Diego Calvanese

                 KRDB Research Centre, Free University of Bozen-Bolzano
                         Piazza Domenicani 3, Bolzano, Italy
                    {rodriguez,calvanese}@inf.unibz.it



       Abstract. Query answering in Ontology Based Data Access (OBDA) exploits
       the knowledge of an ontology’s TBox to deal with incompleteness of the ABox
       (or data source). Current query-answering techniques with DL-Lite require expo-
       nential size query reformulations, or expensive data pre-processing, and hence
       may not be suitable for data intensive scenarios. Also, these techniques present
       severe redundancy issues when dealing with ABoxes that are already (partially)
       complete. It has been shown that addressing redundancy is not only required for
       tractable implementations of decision procedures, but may also allow for sizable
       improvements in execution times. Considering the previous observations, in this
       paper we present two complementary sets of results that aim at improving query
       answering performance in OBDA systems. First, we show that we can characterize
       completeness of an ABox by means of dependencies, and that we can use these to
       optimize DL-Lite TBoxes. Second, we show that in OBDA systems we can create
       ABox repositories that appear to be complete w.r.t. a significant portion of any DL-
       Lite TBox. The combination of these results allows us to design OBDA systems
       in which redundancy is minimal, the exponential aspect of query answering is
       notably reduced and that can be implemented efficiently using existing RDBMSs.


1   Introduction

The current approaches to Ontology Based Data Access (OBDA) with lightweight
Description Logics (DLs) of the DL-Lite family [1] rely on query reformulation. These
techniques are based on the idea of using the ontology to rewrite a given query into a
new query that, when evaluated over the data sources, returns the certain answers to the
original query. Experiments with unions of conjunctive queries (UCQs) have shown that
reformulations may be very large, and that the execution of these reformulations suffers
from poor performance. This triggered the development of alternative reformulation
techniques [6, 9], in which the focus has been on the reduction of the number of generated
queries/rules. These techniques have shown some success, however query reformulation
in all of them is still worst-case exponential in the size of the original query. Alternative
approaches [4] use the expansion of the extensional layer of the ontology (i.e., the ABox)
w.r.t. the intensional knowledge (i.e., the TBox) to avoid query reformulation almost
entirely. However, the cost of data expansion imposes severe limitations on the system.
We believe that approaching the problem of the high cost of query answering in OBDA
systems requires a change of focus: namely, from the ’number of queries’ perspective, to
the perspective that takes into account the ’duplication in the answers’ appearing in query
results under SQL multiset semantics. Duplication in results is a sign of redundancy in
the reasoning process; it not only generated not only by the reformulation procedures
as traditionally thought, since also techniques based on ABox expansion show this
problem. Instead, redundancy is the consequence of ignoring the semantics of the data
sources. In particular, when the data in a source (that is used to populate the ABox of the
ontology) already satisfies an inclusion assertion of the TBox (i.e., is complete w.r.t. such
an inclusion assertion), then using that inclusion assertion during query answering might
generate redundant answers [8]. It has been noted [3] that it is not hard to find examples
in which the runtime of decision procedures can change from exponential to polynomial
if redundancy is addressed, and this is also the case in OBDA query answering. In this
paper we address both problems, redundancy and the exponential blow-up of query
reformulations, by following two complementary directions.
     First, we present an approach to take into account completeness of the data with
respect to the TBox. We characterize completeness using ABox dependencies and show
that it is possible to use dependencies to optimize the TBox in order to avoid redundant
computations independently of the reasoning technique. Second, we focus on how
we can optimally complete ABoxes in OBDA systems by relying on the fact that in
OBDA systems it is possible to manipulate not only the data, but also the mappings and
database schema. This allows us to conceive procedures to store an ABox in a source
in such a way that it appears to be complete with respect to a significant portion of the
TBox, but without actually expanding the data. We present two such procedures, one
for general and one for ’virtual’ OBDA systems, both designed to take advantage of the
features of modern RDBMSs effectively. Our results allow for the design of systems
that can delegate reasoning tasks (e.g., dealing with hierarchies, existentially quantified
individuals, etc.) to stages of the reasoning process where these tasks can be handled
most effectively. The result is a (sometimes dramatic) reduction of the exponential
runtime and an increase in the quality of the answers due to the reduction of duplication.
     The rest of the paper is organized as follows: Section 2 gives technical preliminaries.
Section 3 presents our first main contribution, a general technique for optimizing TBoxes
w.r.t. dependencies. Section 4 introduces data dependencies in OBDA systems, describing
why it is natural to expect completeness of ABoxes. Section 5 presents two techniques
for completing ABoxes in OBDA systems, and Section 6 concludes the paper.


2   Preliminaries

In the rest of the paper, we assume a fixed vocabulary V of atomic concepts, denoted A
(possibly with subscripts), and atomic roles, denoted P , representing unary and binary
relations, respectively, and an alphabet Γ of (object) constants.
Databases. In the following, we regard a database (DB) as a pair D = hR, Ii, where R
is a relational schema and I is an instance of R. The active domain ΓD of D is the set of
constants appearing in I, which we call value constants. An SQL query ϕ over a DB
schema R is a mapping from a DB instance I of R to a set of tuples.
DL-Lite ontologies. We introduce the DL DL-LiteF , on which we base our results.
In DL-LiteF , a basic role, denoted R, is an expression of the form P or P − , and a


                                             2
basic concept, denoted B, is an expression of the form A or ∃R. An ontology is a pair
O = hT , Ai where T is a TBox and A an ABox. A TBox is a finite set of assertions
of the form B1 v B2 , called (positive) inclusions, B1 v ¬B2 , called disjointness
assertions, or (funct R), called functionality assertions. An ABox is a finite set of
membership assertions of the form A(c) or P (c, c0 ), where c, c0 ∈ Γ .
Queries over ontologies. An atom is an expression of the form A(t) or P (t, t0 ), where
t and t0 are atom terms, i.e., variables or constants in Γ . An atom is ground if it contains
no variables. A conjunctive query (CQ) q over an ontology O is an expression of the
form q(x) ← β(x, y), where x is a tuple of distinct variables, called distinguished, y is
a tuple of distinct variables not occurring in x, called non-distinguished, and β(x, y) is
a conjunction of atoms with variables in x and y, whose predicates are atomic concepts
and roles of O. We call q(x) the head of the query and β(x, y) its body. A union of
CQs (UCQ) is a set of CQs (called disjuncts) with the same head. Given a CQ Q with
body β(z) and a tuple v of constants of the same arity as z, we call a ground instance
of Q the set β[z/v] of ground atoms obtained by replacing in β(z) each variable with
the corresponding constant from v.
Semantics. An interpretation I = (∆I , ·I ) consists of a non-empty interpretation
domain ∆I and an interpretation function ·I that assigns to each constant c an element
cI of ∆I , to each atomic concept A a subset AI of ∆I , and to each atomic role P a
binary relation over ∆I . Moreover, basic roles and basic concepts are interpreted as
follows: (P − )I = {(o2 , o1 ) | (o1 , o2 ) ∈ P I } and (∃R)I = {o | ∃o0 . (o, o0 ) ∈ RI }. An
interpretation I is a model of B1 v B2 if B1I ⊆ B2I , of B1 v ¬B2 if B1I ∩ B2I = ∅,
and of (funct R) if for each o, o1 , o2 ∈ ∆I we have that (o, o1 ) ∈ RI and (o, o2 ) ∈ RI
implies o1 = o2 . Also, I is a model of A(c) if cI ∈ AI , and of P (c, c0 ) if (cI , c0I ) ∈ P I .
In DL-LiteF , we adopt the Unique Name Assumption (UNA), which enforces that for
each pair of constants o1 , o2 , if o1 6= o2 , then oI1 6= oI2 . For a DL-LiteF assertion α (resp.,
a set Θ of DL-LiteF assertions), I |= α (resp., I |= Θ) denotes that I is a model of α
(resp., Θ). A model of an ontology O = hT , Ai is an interpretation I such that I |= T
and I |= A. An ontology is satisfiable if it admits a model. An ontology O entails an
assertion α, denoted O |= α, if every model of O is also a model of α. Similarly, for a
TBox T and an ABox A instead of O. The saturation of a TBox T , denoted sat(T ), is
the set of DL-LiteF assertions α s.t. T |= α. Notice that sat(T ) is finite, hence a TBox.
     Let ΓA denote the set of constants appearing in an ABox A. The answer to a CQ
Q = q(x) ← β(x, y) over O = hT , Ai in an interpretation I, denoted ans(Q, O, I),
is the set of tuples c ∈ ΓA × · · · × ΓA such that there exists a tuple c0 ∈ ΓA × · · · × ΓA
such that the ground atoms in β[(x, y)/(c, c0 )] are true in I. The answer to an UCQ
Q in I is the union of the answers to each CQ in Q. The certain answers to Q in O,
denoted cert(Q, O), is the intersection of every ans(Q, O, I) for all models I for O.
The answer to Q over an ABox A, denoted eval (Q, A), is the answers to Q over A
viewed as a DB instance. A perfect reformulation of Q w.r.t. a TBox T is a query Q0 such
that for every ABox A such that hT , Ai is satisfiable, cert(Q, hT , Ai) = eval (Q0 , A).
Mappings. We adopt the definitions for ontologies with mappings from [7]. First we
extend interpretations to be able to create object constants from the value constants in a
DB D. Given an alphabet Λ of function symbols we define the set τ (Λ, ΓD ) of object
terms as the set of all terms of the form f (d1 , . . . , dn ), where f ∈ Λ, the arity of f is


                                                3
n, and d1 , . . . , dn ∈ ΓD . We set Γ = ΓD ∪ τ (Λ, ΓD ), and we extend the interpretation
function so that for each c ∈ τ (Λ, ΓD ) we have that cI ∈ ∆I . We extend queries by
allowing the use of predicate arguments that are variable terms, i.e., expressions of the
form f (t), where f ∈ Λ with arity n and t is an n-tuple of variables or value constants.
Given a TBox T and a DB D, a mapping (assertion) m for T is an expression of the
form ϕ(x) ; ψ(t) where ϕ(x) is an SQL query over D with answer variables x, and
ψ(t) is a CQ over T without non-distinguished variables using variable terms over
variables in x. We call the mapping simple if the body of ψ(t) consists of a single atom,
and complex otherwise. A simple mapping is for an atomic concept A (resp., atomic role
P ) if the atom in the body of ψ(t) has A (resp., P ) as predicate symbol. In the following,
we might abbreviate the query ψ in a mapping by showing only its body. A virtual ABox
V is a tuple hD, Mi, where D is a DB and M a set of mappings, and an ontology with
mappings is a tuple OM = hT , Vi, where T is a TBox and V = hD, Mi is a virtual
ABox in which M is a set of mappings for T .
     An interpretation I satisfies a mapping assertion ϕ(x) ; ψ(t) w.r.t. a DB D =
hR, Ii if for every tuple v ∈ ϕ(I) and for every ground atom X in ψ[x/v] we have that:
if X has the form A(f (c)), then (f (c))I ∈ AI , and if X has the form P (f1 (c1 ), f2 (c2 )),
then ((f1 (c1 ))I , f2 (c2 )I ) ∈ P I . An interpretation I is a model of V = hD, Mi,
denoted I |= V, if it satisfies every mapping in M w.r.t. D. A virtual ABox V entails an
ABox assertion α, denoted V |= α, if every model of V is a model of α. I is a model of
OM = hT , Vi if I |= T and I |= V. As usual, OM is satisfiable if it admits a model.
We note that, in an ontology with mappings OM = hT , hD, Vii, we can always replace
M by a set of simple mappings, while preserving the semantics of OM. It suffices to
split each complex mapping ϕ ; ψ into a set of simple mappings that share the same
SQL query ϕ (see [7]). In the following, we assume to deal only with simple mappings.
Dependencies. ABox dependencies are assertions that restrict the syntactic form of
allowed ABoxes. In this paper, we focus on unary inclusion dependencies only. A
(unary) inclusion dependency is an assertion of the form B1 vA B2 , where B1 and B2
are basic concepts. In the following, for a basic role R and constants c, c0 , R(c, c0 ) stands
for P (c, c0 ) if R = P and for P (c0 , c) if R = P − . An ABox A satisfies an inclusion
dependency σ, denoted A |= σ, if the following holds: (i) if σ is A1 vA A2 , then for
all A1 (c) ∈ A we have A2 (c) ∈ A; (ii) if σ is ∃R vA A, then for all R(c, c0 ) ∈ A
we have A(c) ∈ A; (iii) if σ is A vA ∃R, then for all A(c) ∈ A there exists c0 such
that R(c, c0 ) ∈ A. (iv) if σ is ∃R1 vA ∃R2 , then for all R1 (c, c0 ) ∈ A there exists
c00 such that R2 (c, c00 ) ∈ A. An ABox A satisfies a set of dependencies Σ, denoted
A |= Σ, if A |= σ for each σ ∈ Σ. A set of dependencies Σ entails a dependency
σ, denoted Σ |= σ, if for every ABox A s.t. A |= Σ we also have that A |= σ. The
saturation of a set Σ of dependencies, denoted sat(Σ), is the set of dependencies σ s.t.
Σ |= σ. Given two queries Q1 , Q2 , we say that Q1 is contained in Q2 relative to Σ if
eval (Q1 , A) ⊆ eval (Q2 , A) for each ABox A s.t. A |= Σ.


3    Optimizing TBoxes w.r.t. Dependencies

In a DL-LiteF ontology O = hT , Ai, the ABox A may be incomplete w.r.t. the TBox T ,
i.e., there may be assertions B1 v B2 in T s.t. A 6|= B1 vA B2 . When computing the


                                              4
certain answers to queries over O, the TBox T is used to overcome such incompleteness.
However, an ABox may already be (partially) complete w.r.t. T , e.g., an ABox A
satisfying A1 vA A2 is complete w.r.t. A1 v A2 . While ignoring completeness of an
ABox is ’harmless’ in the theoretical analysis of reasoning over DL-LiteF ontologies,
in practice, it may introduce redundancy, which manifests itself as containment w.r.t.
dependencies among the disjuncts (CQs) of the perfect reformulation, making the
contained disjuncts redundant. For example, let T and A be as before, and let Q be
q(x) ← A2 (x), then any perfect reformulation of Q must include q1 = q(x) ← A1 (x)
and q2 = q(x) ← A2 (x) as disjuncts. However, since q1 is contained in q2 relative to
A1 vA A2 , we have that q1 will not contribute new tuples w.r.t. those contributed by q2 .
    It is possible to use information about completeness of an ABox, expressed as
a set of dependencies, to avoid redundancy in the reasoning process. One place to
do this is during query reformulation, using techniques based on conjunctive query
containment (CQC) with respect to dependencies to avoid the generation of redundant
queries. However, this approach is expensive, since CQC is an NP-complete problem
(even ignoring dependencies), and such optimizations would need to be performed
every time a query is reformulated. We show now how we can improve efficiency by
pre-processing the TBox before performing reformulation. In particular, given a TBox T
and a set Σ of dependencies, we show how to compute a TBox T 0 that is smaller than T
and such that for every query Q the certain answers are preserved if Q is executed over
an ABox that satisfies Σ. Specifically, our objective is to determine when an inclusion
assertion of T is redundant w.r.t. Σ, and to do so we use the following auxiliary notions.
Definition 1. Let T be a TBox, B, C basic concepts, and Σ a set of dependencies over
T . A T -chain from B to C in T (resp., a Σ-chain from B to C in Σ) is a sequence
of concept inclusion assertions (Bi v Bi0 )ni=0 in T (resp., a sequence of inclusion
dependencies (Bi vA Bi0 )ni=0 in Σ), for some n ≥ 0, such that: B0 = B, Bn0 = C, and
                              0                                               0
for 1 ≤ i ≤ n, we have that Bi−1  and Bi are basic concepts s.t., either (i) Bi−1 = Bi ,
         0
or (ii) Bi−1 = ∃R and Bi = ∃R− , for some basic role R.
Intuitively, when there is a T -chain from B to C, the existence of an instance of B in
a model implies the existence of an instance of C. For a Σ-chain, this holds for ABox
assertions. We use T -chains and Σ-chains to characterize redundancy as follows.
Definition 2. Let T be a TBox, B, C basic concepts, and Σ a set of dependencies. The
concept inclusion assertion B v C is directly redundant in T w.r.t. Σ if (i) Σ |= B vA
C and (ii) for every T -chain (Bi v Bi0 )ni=0 with Bn0 = B in T , there is a Σ-chain
(Bi vA Bi0 )ni=0 . Then, B v C is redundant in T w.r.t. Σ if (a) it is directly redundant,
or (b) there exists B 0 6= B s.t. (i) T |= B 0 v C, (ii) B 0 v C is not redundant in T w.r.t.
Σ, and (iii) B v B 0 is directly redundant in T w.r.t. Σ.
    Given a TBox T and a set of dependencies Σ, we apply our notion of redundancy
w.r.t. Σ to the assertions in the saturation of T to obtain a TBox T 0 that is equivalent to
T for certain answer computation.
Definition 3. Given a TBox T and a set of dependencies Σ over T , the optimized
version of T w.r.t. Σ, denoted optim(T , Σ), is the set of inclusion assertions {α ∈
sat(T ) | α is not redundant in sat(T ) w.r.t. sat(Σ)}.


                                             5
Correctness of using T 0 = optim(T , Σ) instead of T when computing the certain
answers to a query follows from the following theorem.

Theorem 1. Let T be a TBox and Σ a set of dependencies over T . Then for every
ABox A such that A |= Σ and every UCQ Q over T , we have that cert(Q, hT , Ai) =
cert(Q, hoptim(T , Σ), Ai).

Proof. First we note that during query answering, only the positive inclusions are
relevant, hence we ignore disjointness and functionality assertions. Since sat(T ) adds to
T only entailed assertions, cert(Q, hT , Ai) = cert(Q, hsat(T ), Ai), for every Q and
A, and we can assume w.l.o.g. that T = sat(T ). Moreover, cert(Q, hT , Ai) is equal to
the evaluation of Q over chase(T , A). (We refer to [1] for the definition of chase for a
DL-LiteF ontology.) Hence it suffices to show that for every B v C that is redundant
with respect to Σ, chase(T , A) = chase(T \ {B v C }, A). We show this by proving
that if B v C is redundant (hence, removed by optim(T , Σ)), then there is always a
chase(T , A) in which B v C is never applicable. Assume by contradiction that B v C
is applicable to some assertion B(c) during some step in chase(T , A). We distinguish
two cases that correspond to the cases of Definition 2.
     (a) Case where B v C is directly redundant and hence Σ |= B vA C. We
distinguish two subcases: (i) B(c) ∈ A. Since A |= B vA C, we have C(c) ∈ A, and
hence B v C is not applicable to B(c). Contradiction. (ii) B(c) ∈     / A. Then there is a
sequence of chase steps starting from some ABox assertion B 0 (c0 ) that generates B(c).
Such a sequence requires a T -chain (Bi v Bi0 )ni=0 with B0 = B 0 and Bn0 = B, such
that each Bi v Bi0 is applicable in chase(T , A). Then, by the second condition of direct
redundancy, there is a Σ-chain (Bi vA Bi0 )ni=0 . Since A |= B0 vA B00 we have that
B00 (c0 ) ∈ A and hence B0 v B00 is not applicable to B 0 (c0 ). Contradiction.
     (b) Case where B v C has been removed by Definition 2(b), and hence there exists
B 0 6= B such that T |= B v B 0 . First we note that any two oblivious chase sequences
for T and A produce results that are equivalent w.r.t. query answering. Then it is enough
to show that there exists some chase(T , A) in which B 0 v C is always applied before
B v C and in which B v C is never applicable. Again, we distinguish two subcases:
(i) B(c) ∈ A. Then, since B v B 0 is directly redundant, we have that Σ |= B vA B 0 .
Since A |= Σ, we have that B 0 (c) ∈ A, and given that B 0 v C is always applied before
B v C, C(c) is added to chase(T , A) before the application of B v C, hence B v C
is in fact not applicable. Contradiction. (ii) B(c) ∈
                                                    / A. Then, arguing as in Case (a).(ii),
using B v B 0 instead of B v C, we can derive a contradiction.                          t
                                                                                        u

Complexity and implementation. Due to space limitations, we cannot provide a full
description of how to compute optim(T , Σ). We just note that the checks that are
required by optim(T , Σ) can be reduced to computing reachability between two nodes
in a DAG that represents the reachability relation of the chains in T and Σ. This operation
can be done in linear time.
Consistency checking. Consistency checking may also suffer from redundancy when
the ABox is already (partially) complete w.r.t. T . In this case, we need to consider,
in addition to inclusion dependencies, also functional and disjointness dependencies.
Due to spaces limitations we cannot provide more details, and just note that using


                                            6
these dependencies it is possible to extend the definitions to generate TBoxes that avoid
redundant consistency checking operations.


4   Dependencies in OBDA Systems

The purpose of the current section is to complement our argument w.r.t. completeness of
ABoxes by discussing when and why we can expect completeness in OBDA systems.
We start by observing that in OBDA systems, ABoxes are constructed, in general, from
existing data that resides in some form of data repository. In order to create an ABox,
the system requires some form of mappings from the source to the ontology. These may
be explicit logical assertions as the ones used in this paper, or they may be implicitly
defined through application code. Therefore, the source queries used in these mappings
become crucial in determining the structure of the ABox. In particular, any dependencies
that hold over the results of these queries will be reflected in the OBDA system as ABox
dependencies.

Example 1. Let R be a DB schema with the relation schema employee with attributes
id, dept, and salary, that stores information about employees, their salaries, and the
department they work for. Let M be the following mappings:

    SELECT id,dept FROM employee ; Employee(emp(id)) ∧
                                   WORKS-FOR(emp(id), dept(dept))
    SELECT id,dept FROM employee ; M anager(emp(id))∧
    WHERE salary > 1000            MANAGES(emp(id), dept(dept))

where Employee and Manager are atomic concepts and WORKS-FOR and MANAGES
are atomic roles. Then for every instance I of R, the virtual ABox V = hhR, Ii, Mi
satisfies the following dependencies:

Manager vA Employee        Manager vA ∃MANAGES ∃WORKS-FOR vA Employee
                         ∃MANAGES vA Manager      Employee vA ∃WORKS-FOR

In particular, the dependency in Column 1 follows from the containment relation between
the two SQL queries used in the mappings, and the remaining dependencies follow from
the fact that we populate WORKS-FOR (resp., MANAGES) using the same SQL query
used to populate Employee (resp., Manager).

    Turning our attention to the semantics of the data sources, we note that any given
DB is based on some conceptual model. At the same time, if we associate the data of any
given DB to the concepts and roles of a TBox T , it follows that this data is semantically
related to these concepts and roles, and that the conceptual model of the DB has some
common aspects with the semantics of T . It is precisely these common aspects that
get manifested as dependencies between queries in the mappings and that give rise to
completeness in ABoxes. Therefore, the degree of completeness of an ABox in an OBDA
system is in direct relation with the closeness of the semantics of the conceptual model
of the DB and the semantics of the TBox, and with the degree in which the DB itself
complies to the conceptual model that was used to design it.


                                            7
Example 2. To illustrate the previous observations we extend Example 1. First we note
that the intended meaning of the data stored in R is as follows: (i) employees with a
salary higher than 1,000 are managers, (ii) managers manage the department in which
they are employed, and (iii) every employee works for a department. Then, any TBox
that shares some of this semantics will present redundancy. For example, if T is

 Manager v Employee         Manager v ∃MANAGES      Employee v ∃WORKS-FOR
                         ∃MANAGES− v Department ∃WORKS-FOR− v Department

then the first row of assertions is redundant w.r.t. Σ. Instead, the semantics of the
assertions of the second row is not captured by the mappings. In an OBDA system with
such components, we should reason only w.r.t. Department. This can be accomplished
by optimizing T w.r.t. Σ using the technique presented in Section 3.


5   Dependency Induction

We focus now on procedures to complete ABoxes with respect to TBoxes. The final
objective is to simplify reasoning by diverting certain aspects of the process (e.g., dealing
with concept hierarchies and domain and range assertions) from the query reformulation
stage to other stages of the query answering process where they can be handled more
efficiently. We call these procedures dependency induction procedures since their result
can be characterized by a set of dependencies that hold in the ABox(es) of the system.
Formally, given an OBDA system O = hT , Vi, where V = hhR, Ii, Mi, we call a
dependency induction procedure a procedure that uses O to compute a virtual ABox V 0
such that the number of assertions in T for which V 0 is complete is higher than those for
V. An example of a dependency induction procedure is ABox expansion, a procedure
in which the data in I is chased w.r.t. T . The critical point in dependency induction
procedures is the trade-off between the degree of completeness induced, the system’s
performance, and the cost of the procedure. The purpose of the current section is to
present two dependency induction mechanism that provide good trade-offs. Both of them
are designed for the case in which the data sources are RDBMSs. Given a TBox T and a
(virtual) ABox V, both techniques are able to generate a system where, if T |= B v A,
then V 0 |= B vA A. Hence, V 0 is complete for all DL-LiteF inferences except those
involving mandatory participation assertions, e.g., B v ∃R.
Semantic Index. The first proposal is applicable in the context of general OBDA systems
in which we are free to manipulate any aspect of the system to improve query answering.
The basic idea of the semantic index is to associate a numeric value to each concept of
the ontology in such a way that for any two concepts A1 , A2 , if T |= A1 v A2 , then the
value of A1 is less than that of A2 . ABox membership assertions are then represented in
the DB using these numeric values instead of concept names. This allows one to retrieve
most of the implied instances of a concept by posing simple range queries to the DB
(which are very efficient in modern RDBMSs). Our proposal is related to a technique for
XPath query evaluation known as Dynamic Intervals [2], however, while the latter deals
with XML trees, we have to deal with concept hierarchies that are DAGs. Formally, a
semantic index is defined as follows, where CT denotes the set of atomic concepts of T .


                                             8
Definition 4. Given a DL-LiteF TBox T , a semantic index for T is a pair of mappings
hidx , rangei with idx : CT → N and range : CT → 2N×N , such that for each pair
of atomic concepts A1 , A2 in CT , we have that T |= A1 v A2 iff there is a pair
h`, hi ∈ range(A2 ) such that ` ≤ idx (A1 ) ≤ h.

     Using a semantic index hidx , rangei for a TBox T , we construct V = hR, Ii with
the completeness properties described above, by proceeding as follows. We define a
DB schema R with a universal-like relation univc[c1, idx] for storing ABox concept
assertions, and one relation role P [c1, c2] for each atomic role P , such that c1 and c2
have type constant and idx has type numeric. Given an ABox A, we construct I such
that for each A(c) ∈ A we have hc, idx (A)i ∈ univc and for each P (c, c0 ) ∈ A we have
hc, c0 i ∈ roleP . The schema and the index allow us to define, for each atomic concept
A, a set of range queries over D that retrieves most constants c such that O |= A(c).
E.g., if range(A) = {h2, 35i}, we define ’SELECT c1 FROM univc WHERE idx
>= 2 AND idx <= 35’. We use these queries to define the mappings of the system
as follows1 : (i) for each atomic concept A and each h`, hi ∈ range(A), we add the
mapping σ`≤idx≤h (univc) ; A(c1 ); (ii) for each atomic role P , we add the mapping
role P ; P (c1 , c2 ); (iii) for each atomic role P and each atomic concept A such that
T |= ∃P v A (resp., ∃P − v A), we add the mapping πc1 (role P ) ; A(c1 ) (resp.,
πc2 (role P ) ; A(c2 )).
     Note that a semantic index can be trivially constructed by assigning to each concept
a unique (arbitrary) value and a set of ranges that covers all the values of subsumed
concepts. However, this is not effective for optimizing query answering since the size of
M determines exponentially the size of the final SQL query. To avoid an exponential
blow-up, we generate idx and range using the implied concept hierarchy as follows.
     Let T be a TBox, and D the minimal DAG that represents the implied is-a relation
between all atomic concepts of T (i.e., the transitive reduct of D)2 . Then we can
construct idx in the following way: (a) Define a counter i = 0; (b) Visit the nodes
in D in a depth-first fashion starting from the root nodes. At each step and given the
node N visited at that step, (1) if idx (N ) is undefined, set idx (N ) = i and i = i + 1,
(2) else if idx (N ) is defined, backtrack until the next node for which idx is undefined.
Now, to generate range we proceed as follows: (a) Visit the nodes in D starting from
the leafs and going up. For each node N in the visit, (1) if N is a leaf in D, then we
set range(N ) = {hidx (N ), idxS    (N )i} (2) if N is not a leaf, then we set range(N ) =
merge({hidx(N ), idx(N )i} ∪ Ni | Ni →N ∈D range(Ni )), where merge is a function
that, given a set r of ranges, returns the minimal set r0 of ranges that has the same
numeric coverage as r, e.g., merge({h5, 7i, h3, 5i, h9, 10i}) = {h3, 7i, h9, 10i}.
Performance of the Semantic Index. We now describe a preliminary evaluation of the
performance of the Semantic Index that was carried out using data from the Resource
Index (RI) [5]. The RI is an application that offers semantic search services over a collec-
tion of 22 well known collections of biomedical data (i.e., documents). The semantics of
the search is defined by the hierarchical information of ≈ 200 ontologies, including well
known bio-medical ontologies like the Gene Ontology, SNOMEDCT, etc. The basic func-
 1
     Here we use relational algebra expressions instead of SQL to simplify the exposition.
 2
     We assume w.l.o.g. that T does not contain a cyclic chain of basic concept inclusions.


                                                 9
tioning of the RI can be summarized as follows: (i) Using natural language processing,
each document is analyzed and annotated with one or more concepts from the ontolo-
gies. The annotations can be seen as ABox assertions, e.g., Cervical Cancer(0 doc2240 ).
(ii) Users pose queries of the form q(x) ← A1 (x) ∧ · · · ∧ An (x), where each Ai is a con-
cept from one of the ontologies in the collection. To compute the answers to the queries,
the RI executes the original query over the expanded ABox data. The information in
the RI’s ontologies amounts to ≈ 3 million concepts and ≈ 2.5 million is-a assertions.
Currently, the number of annotations to be expanded is large, e.g., the annotation process
generates ≈ 181 million annotations (≈ 14 GB of data) only for the documents in the
Clinical Trials.gov (CT) resource collection.
     Our experimentation focused on the cost of query answering by a) traditional query
reformulation, b) query answering using a Semantic Index, and c) query answering with
ABox expansion. To carry out the experimentation we used part of the data from the
RI, specifically, we used the full set of is-a assertions from the ontology collection and
all the document annotations for the CT resource. Also note that since there are no
atomic roles in the RI, the Semantic Index technique can generate a complete virtual
ABox (i.e., optim(T , Σ) = ∅). We stored the data in a DB2 9.7 DB hosted in a
Linux virtual machine with 4x2.67 Ghz Intel Xeon processors (only one core was used)
and 4 GB of RAM available to DB2. Using this data we created a Semantic Index
(time required: 27 s) and created a new table with the CT annotations extended with
the Semantic Index. We issued several queries; the one we describe here is q(x) ←
DNA Repair Gene(x) ∧ Antigen Gene(x) ∧ Cancer Gene(x) over the CT data. The
query returns a total of 2 distinct resources. The results of each technique are as follows3 :
a) when rewriting using traditional query reformulation (see [1, 6]), the result is either
one SQL query with 467874 disjuncts, or a union of 467874 SQL queries; non of these
queries is executable by the engine; b) when applying the Semantic Index technique, the
result of query reformulation is one single SQL query involving 3 range disjunctions; the
query requires 3.582s to execute (0.082s if the DB is warm, e.g., the indexes have been
preloaded); c) if the ABox is expanded, the TBox is also discarded and the reformulation
is equal to the original query; executing this query over the expanded data requires 3 s
(0 s if warm). With respect to the cost of the expansion, LePendu et al. indicate that a
straightforward expansion for the CT resource collection requires ≈ 7 days, generates an
additional ≈ 126 GB of data and, after a careful optimization of the process (including
data partitioning, parallelization, etc.) this time can be reduced to ≈ 40 minutes. Given
these results, we believe that the Semantic Index is a very promising option for semantic
query answering. A system that uses the Semantic Index can provide a better performance
than a system that relies only on query reformulation or ABox expansion.
T -Mappings for OBDA. The second technique we present is applicable in the virtual
OBDA context, where the data should stay in the data sourceand the only aspect of the
virtual ABox we can manipulate are the mappings. Our proposal is based on the idea that,
given an atomic concept A and a basic concept B, if we want to induce a dependency
B vA A, we have to ensure that the mappings for A retrieve all the data retrieved by
those for B. We call the technique T -mappings, and define it formally as follows.
 3
     More information about these tests, including the SQL queries can be found at https://
     babbage.inf.unibz.it/trac/obdapublic/wiki/resourceindex


                                             10
Definition 5. Given a TBox T and a virtual ABox V = hD, Mi, a T -mapping for T
w.r.t. V is a virtual ABox V 0 = hD, M0 i such that for every ABox assertion α such
that V |= α, we have that V 0 |= α, and for each ground ABox assertion α such that
hT , Vi |= α we have that V 0 |= α.

A T -mapping M0 can be constructed trivially by using existing mappings to generate
new mappings that take into account the implications of T as follows: Initialize M0 = ∅,
and for each atomic concept A, (i) for each mapping ϕ ; A0 (f (x)) ∈ M such that T |=
A0 v A, add ϕ ; A(f (x)) to M0 , and (ii) for each mapping ϕ ; P (f (x), g(y)) ∈ M
such that T |= ∃P v A (resp., T |= ∃P − v A), add ϕ ; A(f (x)) (resp., ϕ ;
A(g(y))}) to M0 . However, such T -mappings will not be efficient since the increased
size of M will trigger a worst-case exponential blow-up during SQL query generation.
      To avoid this situation we aim at creating succinct mappings by resorting to SQL
query analysis as follows. Let D = hR, Ii be a DB instance, ΣD a set of dependencies
satisfied by D, SQ = {ϕ1 , . . . , ϕn } a set of SQL queries defined over R, and x a
list of attribute names, then mergesql (SQ, ΣD , x) is the function that returns a new
SQL query ϕ that is equivalent under ΣD to the union of all ϕi ∈ SQ projected on
x. E.g., if SQ = { ’SELECT id FROM employee’, ’SELECT id FROM employee
JOIN information ON id’, ’SELECT id FROM external’} and ΣD is the in-
clusion dependency external[id ] ⊆ employee[id ], then mergesql (SQ, ΣD , [id]) =
’SELECT id FROM employee’. Using mergesql , we define two more auxiliary func-
tions, mergemc and mergemr , which allow us to merge sets of mappings. Let SC be
a set of mappings for atomic concepts, SR a set of mappings for atomic roles, and
A an atomic concept. Let F be the set of all unique (up to variable renaming) vari-
S terms f (x) that are used in a mapping in SC , then mergemc(SC , ΣD , A) =
able
   fi (xi )∈F {mergesql (SQfi , ΣD , xi ) ; A(fi (xi ))}, where SQfi is the set of SQL
queries ϕ of all the mappings ϕ ; A1 (fi (x)) ∈ SC where xi and x are unifiable.
Let F1 (resp., F2 ) be the set of all unique (up to variable renaming) variable terms
f (x) that are used as first (resp., second)   component in the head of a mapping in
SR , then mergemr (SR , ΣD , 1, A) =S fi ∈F1 {mergesql (SQfi , ΣD , xi ) ; A(fi (xi ))}
                                          S
(resp., mergemr (SR , ΣD , 2, A) = fi ∈F2 {mergesql (SQfi , ΣD , xi ) ; A(fi (xi ))}),
where SQfi is the set of SQL queries ϕ of all the mappings ϕ ; P (fi (x), g(y)) (resp.,
ϕ ; P (g(y), fi (x))) in SR where xi and x unify. Let M (A) (resp., M (P )) be the set
of all mappings for the atomic concept A (resp., for the atomic role P ) in M. Then,
given aS   TBox T and a set
                         S M of mappings, the T -mappings for M S   w.r.t. T is defined as
M0 = P ∈V M (P )∪ A∈V mergemc(mA , ΣD , A), with mA = T |=Ai vA M (Ai )∪
mergemr (mr , ΣD , 1, A) ∪ mergemr (mr − , ΣD , 2, A), mr = T |=∃Pi vA M (Pi ), and
                                                                 S

mr − = T |=∃P − vA M (Pi ). Intuitively, this is the original role mappings in M union
            S
                   i
the merging of all the relevant mappings for each A (w.r.t. TBox implication).

Example 3. Let D be the DB in Example 1, M the simple version of the mappings in
that example (i.e., 4 simple mappings), and T the TBox in Example 2. Then the T -
mappings for T is M0 consisting of all the mappings in M plus the additional mapping
m = ’SELECT id,dept FROM employee’ ; Department(dept(dept)) which we
now explain. Note that ∃WORKS-FOR− and ∃MANAGES− are the only subsumees of
Department w.r.t. T and hence, only their mappings (let them be m1 , m2 ) are considered


                                           11
for the mappings of Department by mergemc. Then given the containment between the
queries of m1 and m2 , the output of mergemc({m1 , m2 }, ∅, Department) is m.


6   Conclusions and Future Work
In this paper we focused on issues of redundancy and performance in OBDA systems.
Several directions can be taken starting from the ideas presented here. First, although
TBox pre-processing is the best place to first address redundancy, it is also necessary to
apply redundancy elimination during reasoning, e.g., during query rewriting. Second,
redundancy may also appear during consistency checking, i.e., when an ABox is sound
w.r.t. to the TBox; this situation can also be characterized by ABox dependencies and
should be addressed. With respect to the evaluation of our proposals, further exper-
imentation is still required. In particular, it is necessary to provide a comprehensive
benchmark of the techniques discussed in this paper in comparison to other proposals.
As future work we are now exploring extensions of the techniques to more expressive
languages, e.g., DL-LiteA which supports also role inclusions, and SPARQL queries over
RDFS ontologies. In this last context, we believe that our techniques can be used to pro-
vide high-performance SPARQL end points with sound and complete RDFS entailment
regime support, without relying on inference materialization, as is usually done.
Acknowledgements. We thank Dr. Paea LePendu for providing us with the RI data
and supporting us in understanding this application. This research has been partially
supported by the EU under the ICT Collaborative Project ACSI grant n. FP7-257593.


References
1. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning
   and efficient query answering in description logics: The DL-Lite family. J. of Automated
   Reasoning, 39(3):385–429, 2007.
2. D. DeHaan, D. Toman, M. P. Consens, and M. T. Özsu. A comprehensive XQuery to SQL
   translation using dynamic interval encoding. In Proc. of ACM SIGMOD, pages 623–634, 2003.
3. G. Gottlob and C. Fermüller. Removing redundancy from a clause. Artif. Intell., 61(2), 1993.
4. R. Kontchakov, C. Lutz, D. Toman, F. Wolter, and M. Zakharyaschev. The combined approach
   to query answering in DL-Lite. In Proc. of KR 2010, 2010.
5. P. LePendu, N. Noy, C. Jonquet, P. Alexander, N. Shah, and M. Musen. Optimize first, buy
   later: Analyzing metrics to ramp-up very large knowledge bases. In Proc. of ISWC 2010,
   volume 6496 of LNCS, pages 486–501. Springer, 2010.
6. H. Pérez-Urbina, B. Motik, and I. Horrocks. Tractable query answering and rewriting under
   description logic constraints. J. of Applied Logic, 8(2):186–209, 2010.
7. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking data
   to ontologies. J. on Data Semantics, X:133–173, 2008.
8. M. Rodrı́guez-Muro. Tools and Techniques for Ontology Based Data Access in Lightweight
   Description Logics. PhD thesis, KRDB Research Centre for Knowledge and Data, Free Univ.
   of Bozen-Bolzano, 2010.
9. R. Rosati and A. Almatelli. Improving query answering over DL-Lite ontologies. In Proc. of
   KR 2010, 2010.



                                              12