Query-based comparison of OBDA specifications

                         Meghyn Bienvenu1 and Riccardo Rosati2
                          1
                              Laboratoire de Recherche en Informatique
                               CNRS & Université Paris-Sud, France
              2
                  Dipartimento di Ingegneria informatica, automatica e gestionale
                               Sapienza Università di Roma, Italy


       Abstract. An ontology-based data access (OBDA) system is composed of one
       or more data sources, an ontology that provides a conceptual view of the data, and
       declarative mappings that relate the data and ontology schemas. In order to de-
       bug and optimize such systems, it is important to be able to analyze and compare
       OBDA specifications. Recent work in this direction compared specifications us-
       ing classical notions of equivalence and entailment, but an interesting alternative
       is to consider query-based notions, in which two specifications are deemed equiv-
       alent if they give the same answers to the considered query or class of queries for
       all possible data sources. In this paper, we define such query-based notions of en-
       tailment and equivalence of OBDA specifications and investigate the complexity
       of the resulting analysis tasks when the ontology is formulated in DL-LiteR .


1   Introduction
Ontology-based data access (OBDA) [13] is a recent paradigm that proposes the use
of an ontology as a conceptual, reconciled view of the information stored in a set of
existing data sources. The connection between the ontology and the data sources is
provided by declarative mappings, that relate the elements of the ontology with the
elements of the data sources. The ontology layer is the virtual interface used to access
data, through queries over the elements of the ontology.
    Due to the recent availability of techniques and systems for query processing in this
setting [5, 14], the OBDA approach has recently started to be experimented in real ap-
plications (see e.g. [1, 7, 10]). In these projects, the construction, debugging and main-
tenance of the OBDA specification, consisting of the ontology, the schemas of the data
sources, and the mapping, is a non-trivial task. Actually, the size and the complexity of
the ontology and, especially, the mappings makes the management of such specifica-
tions a practical issue in these projects. Providing formal tools for supporting the above
activities is therefore very important for the successful deployment of OBDA solutions.
    In addition, the OBDA specification plays a major role in query answering, since
the form of the specification may affect the system performance in answering queries:
different, yet semantically equivalent specifications may give rise to very different ex-
ecution times for the same query. So, the study of notions of equivalence and formal
comparison of OBDA specifications is also important for optimizing query process-
ing in OBDA systems. Indeed, some systems already implement forms of optimization
based on transformations of the OBDA specification (an example is [14]).
     So far, most of the work in OBDA has focused on query answering, often in a
simplified setting without any mappings. Very little attention has been devoted to the
formal analysis of OBDA specifications. The first approach that explicitly focuses on
the formal analysis of OBDA specifications is [12], whose aim is the identification
of semantic anomalies in mappings. Such an approach is based on a classical notion
of logical equivalence and entailment between OBDA specifications. While it is very
natural to resort to such classical notions, a significant alternative in many cases may
be the adoption of query-based notions of equivalence and comparison, in which two
specifications are compared with respect to a given query or a given class of queries,
and are deemed equivalent if they give the same answers to the considered queries for
all possible extensions of the data sources. This idea has been already explored in the
data exchange and schema mapping literature (see, e.g., [9]) and for description logics
for comparing TBoxes and knowledge bases [11, 4]. To the best of our knowledge, it
has never been explicitly considered for OBDA specifications.
     The majority of work on on OBDA has considered conjunctive queries (CQs) as the
query language. Therefore, a first natural choice would be to compare OBDA specifica-
tions with respect to the whole class of CQs. We thus define and study a notion of CQ-
entailment between OBDA specifications that formalizes this case. We also consider the
important subclass of instance queries (IQs), i.e., queries that ask for the instances of a
single concept or role, and analyze the notion of IQ-entailment between specifications.
Moreover, in many application contexts only a (small) set of predefined conjunctive
queries are of interest for the OBDA user(s): in such cases, it may be more appropriate
to tailor the comparison of specifications to a specific set of queries. For this reason, we
also study in this paper the notions of single CQ-entailment and single IQ-entailment,
which compare specifications with respect to a single CQ or IQ, respectively.
     We present a first investigation of the computational complexity of deciding the
above forms of entailment for a pair of OBDA specifications. We study ontologies spec-
ified in DL-LiteR and three different mapping languages (linear, GAV and GLAV). In
all cases, we provide exact complexity bounds for the entailment problem. Our results
are summarized in Figure 1. As shown in the table, the complexity of the entailment
check ranges from NL (non-deterministic logarithmic space) for linear mappings and
IQ-entailment to EXPTIME for CQ-entailment. To obtain these results, we show that
instead of considering all possible data instances, it is sufficient to consider a small num-
ber of databases of a particular form. We also exploit connections to query containment
in the presence of signature restrictions [3] and KB query inseparability [4].

2   Preliminaries
We start from four pairwise disjoint countably infinite set of names: the set of concept
names NC , the set of role names NR , the set of relation names Nrel , the set of constant
names NI (also called individuals).
    To introduce OBDA specifications, we first recall the notion of knowledge base
(KB) in Description Logics (DLs). A DL KB is a pair hT , Ai, where: T , called the
TBox, is the intensional component of the KB, and is constituted by a finite set of
axioms expressing intensional knowledge; and A, called the ABox, is a finite set of
atomic concept and role assertions (set of ground facts). We assume that the concept,
role and constant names occurring in every TBox and ABox belong to NC , NR and
NI , respectively. We denote by sig(T ) and sig(A) the set of concept and role names
occurring in T and A, respectively.
     Although the definitions of Section 3 are general, in Section 4 we will focus on
the DL DL-LiteR [6]. A DL-LiteR TBox consists of a finite set of concept inclusions
B v C and role inclusions R v S, where B, C, R, and S are defined according to the
following syntax (where A is a concept name and P is a role name):

         B → A | ∃R         C → B | ¬B          R → P | P−          S → R | ¬R

     We now introduce OBDA specifications. As already explained, a mapping asser-
tion specifies the semantic relationship between elements of a DL ontology, specified
through a TBox, to elements of a database. Such a relationship is specified through
a pair of queries, one over the TBox signature, and the other one over the database
signature. In this paper, we focus on the case where both queries involved in the map-
ping assertion are conjunctive queries: such mapping assertions are called GLAV (for
‘global-as-view’) mappings in the literature [8].
     Mappings are formally defined as follows. An atom is an expression r(t) where r
is a predicate and t is a tuple of variables and constants. Then, a (GLAV) mapping as-
sertion m is an expression of the form qs (x) → qo (x), where qs (x) (called the body
of m, body(m)) is a conjunction of atoms over predicates from Nrel and constants from
NI , qo (x) (called the head of m, head (m)) is a conjunction of atoms using predicates
from NC ∪ NR and constants from NI , and x, called the frontier variables of m, are the
variables that appear both in qo and in qs . The arity of m is the number of its frontier
variables. When qo (x) has the form p(x) (i.e., qo (x) is a single atom whose arguments
are x), we call m a GAV mapping assertion. A linear mapping assertion is a GAV asser-
tion whose body consists of a single atom. A (GLAV) mapping M is a set of mapping
assertions. A GAV mapping is a mapping constituted of GAV mapping assertions. A
linear mapping is a set of linear mapping assertions. Without loss of generality, we as-
sume that in every mapping M, every pair of distinct mapping assertions uses pairwise
disjoint sets of variables.
     An OBDA specification is a pair Γ = hT , Mi, where T is a TBox and M is a
mapping. Given a mapping assertion m of arity n and an n-tuple of constants a, we
denote by m(a) the assertion obtained from m by replacing the frontier variables with
the constants in a.
     Given a set of atoms AT , gr (AT ) is the function that returns the set of ground atoms
obtained from AT by replacing every variable symbol x with a fresh constant symbol
cx . We assume that such constant symbols do not occur elsewhere in the application
context of the function gr (i.e., in the TBoxes, mappings and databases involved).
     In this paper, a database (instance) is a set of ground atoms using relation names
from Nrel and constant names from NI . Given a mapping M and a database instance D,
we define the ABox for D and M, denoted as AM,D , as the following ABox:

          { β ∈ gr (head (m(a))) | m ∈ M and and D |= ∃y.body(m(a)) }

where we assume that y are the variables occurring in body(m(a)). Given an OBDA
specification Γ = hT , Mi and a database instance D, we define the models of Γ and
D, denoted as M ods(Γ, D) as the set of models of the KB hT , AM,D i. When such a set
is empty, we write hT , M, Di |= ⊥ (analogously, when a KB hT , Ai has no models,
we write hT , Ai |= ⊥).
    We are interested in the problem of answering instance queries and conjunctive
queries over a pair composed of an OBDA specification and a database. A Boolean
conjunctive query (CQ) is an expression of the form ∃x(α1 ∧ . . . ∧ αn ) where every αi
is an atom whose arguments are either constants or variables from x. For a non-Boolean
CQ q with answer variables v1 , . . . , vk , a tuple of constants a = ha1 , . . . , ak i occurring
in A is said to be a certain answer for q w.r.t. K just in the case that K |= q(a),
where q(a) is the Boolean query obtained from q by replacing each vi by ai . We call
instance query (IQ) a CQ consisting of a single atom of the form A(x) or R(x, y), with
A concept name, R role name, and x, y distinct free variables. We denote by sig(q) the
set of concept and role names occurring in a query q. We use CQ (resp. IQ ) to refer the
set of all CQs (resp. IQs) over the DL signature NC ∪ NR .
    Given an OBDA specification Γ = hT , Mi, a database instance D, and a conjunc-
tive query q, we define the certain answers for q w.r.t. (Γ, D) as the tuples of constants
from D that are certain answers for q w.r.t. hT , AM,D i. In particular, for Boolean CQs,
we say that q is entailed by (Γ, D), denoted by (Γ, D) |= q (or hT , M, Di |= q),
if I |= q for every I ∈ M ods(Γ, D). Note that for non-Boolean queries, we only
consider tuples of constants from D, in order to avoid including those fresh constants
introducing in AM,D by grounding existential variables in mapping heads.

3    Query-based Entailment for OBDA Specifications
We start by recalling the classical notion of entailment between OBDA specifications.
Definition 1 (Logical entailment). An OBDA specification hT1 , M1 i logically entails
hT2 , M2 i, written hT1 , M1 i |=log hT2 , M2 i if and only the first-order theory T1 ∪ M1
logically entails the first-order theory T2 ∪ M2 .
     We now define the formal notions of query-based entailment between OBDA spec-
ifications considered in this paper. First, we introduce a notion of entailment that com-
pares specifications based upon the constraints they impose regarding consistency.
Definition 2 (⊥-entailment). Let q be a query. An OBDA specification hT1 , M1 i ⊥-
entails hT2 , M2 i, written hT1 , M1 i |=⊥ hT2 , M2 i, iff, for every database D,
                       hT2 , M2 , Di |= ⊥      ⇒      hT1 , M1 , Di |= ⊥
    Next, we define a notion of query entailment between OBDA specifications with
respect to a single query.
Definition 3 (Single query entailment). Let q be a query. An OBDA specifica-
tion hT1 , M1 i q-entails hT2 , M2 i, written hT1 , M1 i |=q hT2 , M2 i, if and only if
hT1 , M1 i |=⊥ hT2 , M2 i and for every database D,
                   hT2 , M2 , Di |= q(a)       ⇒      hT1 , M1 , Di |= q(a)
When q is an IQ, we call the entailment relation in the preceding definition single IQ-
entailment, while we call it single CQ-entailment if q is an arbitrary CQ.
    We can generalize the previous definition to classes of queries as follows.
Definition 4 (Query entailment). Let L be a (possibly infinite) set of queries. An
OBDA specification hT1 , M1 i L-entails hT2 , M2 i, written hT1 , M1 i |=L hT2 , M2 i
iff hT1 , M1 i |=⊥ hT2 , M2 i and hT1 , M1 i |=q hT2 , M2 i for every query q ∈ L.
When L = IQ, we call the preceding entailment relation IQ-entailment, and for L =
CQ, we use the term CQ-entailment.
    Note that, for each of the above notions of entailment, a notion of equivalence be-
tween OBDA specifications can be immediately derived, corresponding to entailment
in both directions (we omit the formal definitions due to space limitations).
    The following property immediately follows from the above definitions.
Proposition 1. Let hT1 , M1 i, hT2 , M2 i be two OBDA specifications, and let L1 be a
set of queries. Then, hT1 , M1 i |=log hT2 , M2 i implies hT1 , M1 i |=L1 hT2 , M2 i. More-
over, if L2 ⊆ L1 , then hT1 , M1 i |=L1 hT2 , M2 i implies hT1 , M1 i |=L2 hT2 , M2 i.
    As a consequence of the above property, we have that logical entailment implies
CQ-entailment, and CQ-entailment implies IQ-entailment. The converse implications
do not hold, as the following examples demonstrate.
Example 1. We start by illustrating the difference between logical entailment
and CQ-entailment. Consider a database containing instances for the relation
EXAM(studentName,courseName,grade,date). Then, let Γ1 = hT1 , M1 i, where
                  T1 = {Student v Person, PhDStudent v Student}
                  M1 = {EXAM(x, y, z, w) → Student(x)}
and let Γ2 = hT2 , M2 i, where T2 = {Student v Person} and M2 = M1 . It is
immediate to verify that Γ2 6|=log Γ1 . However, we have that Γ2 |=CQ Γ1 . Indeed,
Γ2 |=CQ Γ1 can be intuitively explained by the fact that the mapping M1 does not
retrieve any instances of the concept PhDStudent (and there are no subclasses that can
indirectly populate it), so the presence of the inclusion PhDStudent v Student in T1
does not have any effect on query answering; in particular, every CQ that mentions
the concept PhDStudent cannot be entailed both under Γ1 and under Γ2 . Notice also
that, if we modify the mapping M1 to map PhDStudent instead of Student (i.e., if M1
were {EXAM(x, y, z, w) → PhDStudent(x)}), then CQ-entailment between Γ2 and Γ1
would no longer hold.
    Next, consider Γ3 = hT3 , M3 i, where T3 = ∅ and
     M3 = {EXAM(x, y, z, w) → Student(x), EXAM(x, y, z, w) → Person(x)}
Again, it it immediate to see that Γ3 6|=log Γ2 , while we have that Γ3 |=CQ Γ2 . Indeed,
Γ3 |=CQ Γ2 follows informally from the fact that the mapping M3 is able to “exten-
sionally” simulate the inclusion Student v Person of T2 , which is sufficient for Γ3 to
entail every CQ in the same way as Γ2 .
Example 2. We slightly modify the previous example to show the difference between
CQ-entailment and IQ-entailment. Consider Γ1 = hT1 , Mi and Γ2 = hT2 , Mi where
                  T1 = {Student v Person, Student v ∃takesCourse}
                  T2 = {Student v Person}
                  M = {EXAM(x, y, z, w) → Student(x)}
         Type of entailment    Type of mapping            Complexity
         logical               GAV / GLAV                 NP-complete
                               linear                     NL-complete
         ⊥                     GAV / GLAV                 NP-complete
                               linear                     NL-complete
         CQ                    linear / GAV / GLAV        EXPTIME-complete
         IQ                    linear                     NL-complete
                               GAV / GLAV                 NP-complete
         single CQ             linear / GAV / GLAV        Π2p -complete
         single IQ             linear                     NL-complete
                               GAV / GLAV                 NP-complete

     Fig. 1. Complexity results for entailment between OBDA specifications in DL-LiteR

Then, it can be easily verified that Γ2 6|=CQ Γ1 . Indeed, consider the Boolean CQ
∃x, y takesCourse(x, y): for every database D, this query is not entailed by the pair
(Γ2 , D), while this is not the case when the specification is Γ1 . On the other hand, we
have that Γ2 |=IQ Γ1 : in particular, for every database D and for every pair of individ-
uals a, b, neither (Γ1 , D) nor (Γ2 , D) entails the IQ takesCourse(a, b). Finally, let q be
the non-Boolean CQ ∃x takesCourse(x, y): then, it can be easily verified that the single
CQ-entailment Γ2 |=q Γ1 holds; while for the CQ q 0 of the form ∃y takesCourse(x, y),
the single CQ-entailment Γ2 |=q0 Γ1 does not hold.

4   Complexity Results for DL-LiteR
In this section, we investigate the computational properties of the different notions of
entailment between OBDA specifications defined in the previous section. For this first
study, we focus on the case in which the TBox is formulated in DL-LiteR [6], as it is
the basis for the OWL 2 QL profile and one of the most commonly considered DLs for
OBDA. The results of our complexity analysis are displayed in Figure 1.
    In what follows, we formally state the different complexity results and provide some
ideas about the proofs. We begin by considering the complexity of deciding classical
entailment between OBDA specifications.
Theorem 1. Classical logical entailment for OBDA specifications based upon
DL-LiteR TBoxes is NP-complete for GAV or GLAV mappings, and NL-complete for
linear mappings.

Proof. Let Γ1 = hT1 , M1 i, Γ2 = hT2 , M2 i. First, it is easy to see that Γ1 |=log Γ2 iff
(i) T1 |= T2 ; and (ii) Γ1 |=log M2 . Property (i) can be decided in NL [2]. Property (ii)
can be decided by an algorithm that, for every assertion m ∈ M2 , first builds a database
D corresponding to gr (body(m)) (i.e., obtained by “freezing” the body of m), and then
checks whether hΓ1 , Di entails the CQ corresponding to the head of m whose frontier
variables have been replaced by the corresponding constants. This algorithm runs in
NP in the case of GAV and GLAV mappings, and in NL in the case of linear mappings,
which implies the overall upper bounds in the theorem statement. The lower bound
for GAV mappings can be obtained through an easy reduction of conjunctive query
containment to logical entailment, while the one for linear mappings follows from a
reduction of the entailment of a concept inclusion axiom in a DL-LiteR TBox.     t
                                                                                 u

   We next consider ⊥-entailment. Our upper bounds rely on the following result that
shows it is sufficient to consider a small number of small databases.

Theorem 2. Let q be a CQ, and let Γ1 = hT1 , M1 i and Γ2 = hT2 , M2 i be OBDA
specifications such that T1 , T2 are formulated in DL-LiteR , and M1 , M2 are GLAV
mappings. Then Γ1 |=⊥ Γ2 if and only if hT1 , M1 , Di |= ⊥ for every database D
satisfying the following condition:1

 – Condition 1: D is obtained by (i) taking two mapping assertions m1 , m2 from
   M2 , (ii) selecting atoms α1 and α2 from head (m1 ) and head (m2 ) respectively,
   (iii) identifying in m1 and m2 some variables from α1 and α2 in such a way that
   hT , gr ({α1 , α2 })i |= ⊥, (iv) setting D equal to gr (body(m1 ) ∪ body(m2 )).

Proof. The one direction is immediate from the definitions. For the interesting direc-
tion, let us suppose that hT1 , M1 , Di |= ⊥ for every database D satisfying Condition
1. Let us further suppose that we have hT2 , M2 , D0 i |= ⊥, where D0 may be any
database. We thus have hT2 , AM2 ,D0 i |= ⊥. It is well known that every minimal in-
consistent subset of a DL-LiteR KB contains at most two ABox assertions, so there
must exist a subset A0 ⊆ AM2 ,D with |A0 | ≤ 2 such that hT2 , A0 i |= ⊥. Let γ be
the conjunction of atoms obtained by taking for each ABox assertion in A0 , a mapping
assertion that produced it, identifying those variables (and only those variables) needed
to produce the ABox assertion(s), and then taking the conjunction of the atoms in the
bodies. We observe that by construction Dγ = gr (γ) satisfies Condition 1 and is such
that hT1 , M1 , Dγ i |= ⊥. By construction, there is a homomorphism of γ into the origi-
nal database D0 . It follows that hT1 , M1 , D0 i |= ⊥.                                 t
                                                                                        u

      Using the preceding result, we can pinpoint the complexity of ⊥-entailment.

Theorem 3. The ⊥-entailment problem is NP-complete for OBDA specifications based
upon DL-LiteR TBoxes and GAV / GLAV mappings, and NL-complete in the case of
linear mappings.

Proof. We know from Theorem 2 that Γ1 |=⊥ Γ2 iff hT1 , M1 , Di |= ⊥ for every
database D satisfying Condition 1. For the GAV / GLAV case, we guess one such
database D and a polynomial-size proof that hT1 , M1 , Di |= ⊥. For the linear case, we
note that the databases satisfying Condition 1 contain at most 2 tuples each and can be
enumerated in logarithmic space. For every such database, we can check using an NL
oracle whether hT1 , M1 , Di |= ⊥. Since LNL = NL, we obtain an NL procedure.         t
                                                                                      u

   Next we consider entailment with respect to a specific query. We again start by
showing it is sufficient to consider a finite number of databases of a particular form.
 1
     Recall that distinct mapping assertions in a mapping have no common variables.
Theorem 4. Let q be a CQ, and let Γ1 = hT1 , M1 i and Γ2 = hT2 , M2 i be OBDA
specifications such that T1 , T2 are formulated in DL-LiteR , and M1 , M2 are GLAV
mappings. Then Γ1 |=q Γ2 if and only if Γ1 |=⊥ Γ2 and hT2 , M2 , Di |= q(a) implies
hT1 , M1 , Di |= q(a) for every database D satisfying the following condition:
 – Condition 2: D is obtained by (i) taking k ≤ |q| mapping assertions
   m1 , m2 , . . . , mk from M2 , (ii) identifying some of the frontier variables in
   m1 , m2 , . . . , mk , (iii) letting D = gr (body(m1 ) ∪ body(m2 ) ∪ . . . ∪ body(mk )).
If q is an IQ, then the latter condition can be replaced by:
 – Condition 3: D is obtained by (i) taking a mapping assertion m from M2 and
   choosing an atom α ∈ head (m), (ii) possibly identifying in m the (at most two)
   frontier variables appearing in α, and (iii) letting D = gr (body(m)).
Proof. Again the one direction is immediate. To show the non-trivial direction, let us
suppose that Γ1 |=⊥ Γ2 and that hT2 , M2 , Di |= q(c) implies hT1 , M1 , Di |= q(c)
for every tuple c and database D satisfying Condition 2 (we return later to the case
of IQs). Let us further suppose that we have hT2 , M2 , D0 i |= q(a). The first pos-
sibility is that hT2 , M2 , D0 i |= ⊥, in which case we have hT1 , M1 , D0 i |= ⊥ be-
cause of Γ1 |=⊥ Γ2 . We thus obtain hT1 , M1 , D0 i |= q0 (a). The other possibility
is that hT2 , M2 , D0 i |= q0 (a) and hT2 , M2 , D0 i 6|= ⊥. If hT1 , M1 , D0 i |= ⊥, we
immediately obtain hT1 , M1 , D0 i |= q0 (a). Otherwise, let AM2 ,D0 be the ABox for
M2 and D0 . Since hT2 , M2 , D0 i |= q0 (a), we have hT2 , AM2 ,D0 i |= q0 (a). It is
a well-known property of DL-LiteR that there exists a subset A0 ⊆ AM2 ,D0 with
|A0 | ≤ |q0 | such that hT2 , A0 i |= q0 (a). Let |A0 | = k, and let β1 , . . . , βk be the
ABox assertions in A0 . For each βi , we choose a mapping assertion mi ∈ M2 and
a homomorphism hi of body(mi ) into D0 such that gr (hi (head (m))) contains βi .
We also select an atom αi ∈ head (m) such that gr (h(α)) = β. Let m0i be ob-
tained from mi by identifying frontier variables y and z if hi (y) = hi (z), and set
D0 = gr (body(m01 ) ∪ . . . ∪ body(m0k )). It is easy to see that D0 satisfies Condition 2.
Moreover, by construction, the ABox AM2 ,D0 contains a subset A00 that is isomorphic
to A0 , and so hT2 , M2 , D0 i |= q0 (a0 ) where a0 is tuple corresponding to a according to
this isomorphism. Applying our assumption, we obtain hT1 , M1 , D0 i |= q0 (a0 ). Using
the fact that there is a homomorphism of body(m01 ) ∪ . . . ∪ body(m0k ) into D0 that is an
isomorphism on the frontier variables, we obtain hT1 , M1 , Di |= q0 (a).
     Finally, for the case of instance queries, we simply note that we have k = 1, and it
is only necessary to identify those variables in the head atom of the mapping that leads
to introducing the single ABox assertion of interest. This yields Condition 3.            t
                                                                                          u
   We pinpoint the complexity of single CQ-entailment, showing it to be Π2p -complete.
Theorem 5. The single CQ-entailment problem is Π2p -complete for OBDA specifica-
tions based upon DL-LiteR TBoxes and GLAV mappings. The lower bound holds even
for linear mapping assertions and when both TBoxes are empty.
Proof. For the upper bound, consider two OBDA specifications Γ1 = hT1 , M1 i and
Γ2 = hT2 , M2 i. From Theorems 2 and 4, we know that Γ1 6|=q Γ2 if and only if one of
the following holds:
 – there is a database D satisfying Condition 2 such that hT1 , M1 , Di 6|= ⊥;
 – there is a database D satisfying Condition 2 such that hT2 , M2 , Di |= q(a),
   hT2 , M2 , Di 6|= ⊥, and hT1 , M1 , Di 6|= q(a).
The first item can be checked using an NP oracle (by Theorem 3). To check the sec-
ond item, we remark that the size of databases satisfying Condition 2 cannot exceed
max(2, |q|) · maxbody, where maxbody is the maximum number of atoms appearing
in the body of a mapping assertion in M2 . It follows that to show that the second item
above is violated, we can guess a database D of size at most max(2, |q|) · maxbody
together with a tuple of constants a and a polynomial-size proof that hT2 , M2 , Di |=
q(a), and then we can verify using an NP oracle that hT1 , M1 , Di 6|= q(a). We there-
fore obtain a Σ2p procedure for deciding the complement of our problem.
    For the lower bound, we utilize a result from [3] on query containment
over signature-restricted ABoxes. In that paper, it is shown how, given a 2QBF
∀u∃vϕ(u, v), one can construct a TBox T , Boolean CQs q1 and q2 , and a signature
Σ such that ∀u∃vϕ(u, v) is valid iff T , A |= q1 ⇒ T , A |= q2 for all ABoxes A
with sig(A) ⊆ Σ. We will not detail the construction but simply remark that the same
TBox T = {T v V, F v V } is used for all QBFs, the signature Σ is given by
(sig(T ) ∪ sig(q1 ) ∪ sig(q2 )) \ {V }, and the query q2 is such that V 6∈ sig(q2 ).
    In what follows, we will show how given T , q1 , q2 , and Σ as above, we can reduce
the problem of testing whether T , A |= q1 implies T , A |= q2 for all Σ-ABoxes to
the problem of single CQ entailment. We will use Σ for our database instances, and we
create two copies Σ1 = {P 1 | P ∈ Σ} and Σ2 = {P 2 | P ∈ Σ} of the signature
Σ to be used in the head of mapping assertions. Next, we define sets of mapping as-
sertions copy1 (Σ) and copy2 (Σ) that simply copies all of the predicates in Σ into the
corresponding symbol in Σ1 (resp. Σ2 ). Formally, for j ∈ {1, 2},
copyj (Σ) = {A(x) → Aj (x) | A ∈ Σ ∩ NC } ∪ {R(x, y) → Rj (x, y) | R ∈ Σ ∩ NR }
We further define, given a data signature Λ1 and DL signature Λ2 , a set
populate(Λ1 , Λ2 ) of mapping assertions that populates the relations in Λ2 using all
possible combinations of the constants appearing in tuples over Λ1 :
  populate(Λ1 , Λ2 ) ={P1 (x1 , . . . , xk ) → P2 (x01 , . . . , x0` ) | P1 ∈ Λ1 , arity(P ) = k,
                                  P2 ∈ Λ2 , arity(P ) = `, {x01 , . . . , x0` } ⊆ {x1 , . . . , xk )}
Using copy1 (Σ), copy2 (Σ), populate(Σ, Σ 1 ), and populate(Σ, Σ 2 ), we construct the
following mappings:
       M1 = populate(Σ, Σ 1 ) ∪ copy2 (Σ)
       M2 = copy1 (Σ) ∪ populate(Σ, Σ 2 ) ∪ {T (x) → V (x), F (x) → V (x)}
Observe that both mappings are linear. For the query, we let q10 (resp. q20 ) be obtained
from q1 (resp. q2 ) by replacing every predicate P by P 1 ( resp. P 2 ). We also rename
variables so that q10 and q20 do not share any variables. We then let q be the CQ obtained
by taking the conjunction of q10 and q20 and existentially quantifying all variables. In the
appendix, we show that h∅, M1 i |=q h∅, M2 i iff T , A |= q1 ⇒ T , A |= q2 for all
Σ-ABoxes. By combining this with the reduction from [3], we obtain a reduction from
universal 2QBF to the q-entailment problem, establishing Π2p -hardness of the latter. t    u
   If we consider IQs instead, the complexity drops to either NP- or NL-complete.

Theorem 6. The single IQ-entailment problem is NP-complete for OBDA specifica-
tions based upon DL-LiteR TBoxes and either GAV or GLAV mappings. It is NL-
complete if linear mappings are considered.

Proof. We give the arguments for GAV and GLAV mappings (for linear case, see the
appendix). For the NP upper bound, consider two OBDA specifications Γ1 = hT1 , M1 i
and Γ2 = hT2 , M2 i, and let q be an IQ. By Theorem 4, Γ1 |=q Γ2 if and only if Γ1 |=⊥
Γ2 and hT2 , M2 , Di |= α implies hT1 , M1 , Di |= α for all databases D satisfying
Condition 3 and for all Boolean IQs α obtained by instantiating the variable(s) in q
with constant(s) from D.
    We already know that it is in NP to test whether Γ1 |=⊥ Γ2 . For the second property,
observe that there are only polynomially many databases satisfying Condition 2, since
each corresponds to choosing a mapping assertion m in M2 , an atom α ∈ head (m),
and deciding whether or not to identify variables in α. For every such database D, we
compute (in polynomial time) the set of Boolean IQs β obtained by instantiating the IQ
q with constants from D for which hT2 , M2 , gr (head (m))i |= β. For every such β, we
guess a polynomial-size proof that hT1 , M1 , Di |= β. If all of our polynomially many
guesses succeed, then the procedure returns yes, and otherwise no. By grouping all of
the guesses together, we obtain an NP decision procedure.
    The NP lower bound is by reduction from the NP-complete CQ containment prob-
lem: given two CQs q1 , q2 both having a single answer variable x, we have q1 ⊆
q2 iff h∅, {q2 → A(x)}i |=A(x) h∅, {q1 → A(x)}i, where A is a concept name
that does not appear in either of q1 and q2 .                                          t
                                                                                       u

   Finally, we consider entailment with respect to entire classes of queries. Again, we
can show it is sufficient to consider a small number of databases of a particular form.

Theorem 7. Let Γ1 = hT1 , M1 i and Γ2 = hT2 , M2 i be as in Theorem 4. For L ∈
{CQ, IQ}, Γ1 |=L Γ2 if and only if Γ1 |=⊥ Γ2 and hT2 , M2 , Di |= q(a) implies
hT1 , M1 , Di |= q(a) for every q ∈ L and every database D that satisfies Condition 3.

   We show that testing CQ-entailment is much more difficult than for single CQs.
Both the upper and lower bounds use recent results on KB query inseparability [4].

Theorem 8. CQ-entailment is EXPTIME-complete for OBDA specifications based
upon DL-LiteR TBoxes and either GLAV, GAV, or linear mappings.

Proof. We start with the proof of membership in EXPTIME. Consider OBDA spec-
ifications Γ1 = hT1 , M1 i and Γ2 = hT2 , M2 i. By Theorem 7, Γ1 |=CQ Γ2 if and
only if Γ1 |=⊥ Γ2 and hT2 , M2 , Di |= q(a) implies hT1 , M1 , Di |= q(a) for ev-
ery choice of q(a) and every database D satisfying Condition 3. We know that testing
Γ1 |=⊥ Γ2 can be done in NP (Theorem 3). To decide whether the second property
holds, we consider each of the (polynomially many) databases satisfying Condition 3.
For every such database D, we generate the two ABoxes AM1 ,D and AM2 ,D and the
corresponding KBs K1 = hT1 , AM1 ,D i and K2 = hT2 , AM2 ,D i. We then test whether
it is the case that for every CQ q over sig(K2 ), K2 |= q(a) implies K1 |= q(a), and we
return no if this is not the case. The preceding check corresponds to the Σ-query en-
tailment problem for DL-LiteR KBs, which has been recently studied in [4] and shown
to be EXPTIME-complete. We therefore obtain an EXPTIME procedure for deciding
CQ-entailment between OBDA specifications.
    Our lower bound also makes use of the recent work on query inseparability of
DL-LiteR knowledge bases. In [4], the following problem is shown to be EXPTIME-
complete: given DL-LiteR TBoxes T1 and T2 that are consistent with the ABox {A(c)},
decide whether the certain answers for q w.r.t. hT2 , {A(c)}i are contained in those for
hT1 , {A(c)}i for every CQ q with sig(q) ⊆ sig(T2 ). To reduce this problem to the CQ-
entailment problem for OBDA specifications, we consider the following linear map-
ping that populates a fresh concept A0 with all constants of a Σ-instance (refer to the
proof of Theorem 5 for the definition of populate): M1 = M2 = populate(Σ, {A0 }).
To complete the proof, we show in the appendix that hT1 , M1 i |=CQ hT2 , M2 i iff
hT2 , {A(c)}i |= q(a) implies hT1 , {A(c)}i |= q(a) for every CQ q, where T10 and T20
are obtained from T1 and T2 by replacing A with A0 .                                  t
                                                                                      u
    Our final result shows that IQ-entailment has the same complexity as single IQ-
entailment. The proof proceeds similarly to the proof of Theorem 6.
Theorem 9. IQ-entailment is NP-complete for OBDA specifications based upon
DL-LiteR TBoxes and either GAV or GLAV mappings. It is NL-complete if linear map-
pings are considered.

5     Conclusion and Future Work
In this paper, we have introduced notions of query-based entailment of OBDA spec-
ifications and have analyzed the complexity of checking query-based entailment for
different classes of queries and mappings and for TBoxes formulated in DL-LiteR .
     The present work constitutes only a first step towards a full analysis of query-based
forms of comparing OBDA specifications, and can be extended in several directions:
    – First, it would be interesting to extend the computational analysis of query entail-
      ment to other DLs beyond DL-LiteR . For instance, one interesting question for DLs
      with functional or cardinality restrictions concerns the impact of the Unique Name
      Assumption on the complexity of (and the techniques for) query entailment.
    – Second, other forms of mapping beyond GAV and GLAV could be analyzed. In par-
      ticular, we would like to see whether decidability of query entailment is preserved
      if we add some restricted form of inequality or negation to the mapping bodies.
    – Third, we could introduce a query signature and only test entailment for queries
      formulated in the given signature, as has been done for TBox and KB query insep-
      arability [4]. In fact, all of the complexity upper bounds in this paper hold also if
      we introduce a query signature, but this may not be the case for other DLs.
    – Finally, to explore the impact of restricting the set of possible databases, we could
      extend the computational analysis to database schemas with integrity constraints.
Acknowledgments. This research has been partially supported by the EU under FP7
project Optique (grant n. FP7-318338) and by the French National Research Agency
under ANR project PAGODA (grant n. ANR-12-JS02-007-01).
References
 1. N. Antonioli, F. Castanò, C. Civili, S. Coletta, S. Grossi, D. Lembo, M. Lenzerini, A. Poggi,
    D. F. Savo, and E. Virardi. Ontology-based data access: the experience at the Italian De-
    partment of Treasury. In Proc. of the Industrial Track of the 25th Int. Conf. on Advanced
    Information Systems Engineering (CAiSE), 2013.
 2. A. Artale, D. Calvanese, R. Kontchakov, and M. Zakharyaschev. The DL-Lite family and
    relations. J. of Artificial Intelligence Research, 36:1–69, 2009.
 3. M. Bienvenu, C. Lutz, and F. Wolter. Query containment in description logics reconsidered.
    In Proc. of the 13th Int. Conf. on the Principles of Knowledge Representation and Reasoning
    (KR), 2012.
 4. E. Botoeva, R. Kontchakov, V. Ryzhikov, F. Wolter, and M. Zakharyaschev. Query insepara-
    bility for description logic knowledge bases. In Proc. of the 14th Int. Conf. on the Principles
    of Knowledge Representation and Reasoning (KR), 2014.
 5. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodriguez-Muro,
    R. Rosati, M. Ruzzi, and D. F. Savo. The Mastro system for ontology-based data access.
    Semantic Web J., 2(1):43–53, 2011.
 6. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning
    and efficient query answering in description logics: The DL-Lite family. J. of Automated
    Reasoning, 39(3):385–429, 2007.
 7. D. Calvanese, M. Giese, P. Haase, I. Horrocks, T. Hubauer, Y. Ioannidis, E. Jiménez-Ruiz,
    E. Kharlamov, H. Kllapi, J. Klüwer, M. Koubarakis, S. Lamparter, R. Möller, C. Neuenstadt,
    T. Nordtveit, Ö. Özcep, M. Rodriguez-Muro, M. Roshchin, F. Savo, M. Schmidt, A. Soylu,
    A. Waaler, and D. Zheleznyakov. Optique: OBDA solution for big data. In Revised Selected
    Papers of ESWC 2013 Satellite Events, volume 7955 of Lecture Notes in Computer Science,
    pages 293–295, 2013.
 8. A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann,
    2012.
 9. G. Gottlob, R. Pichler, and V. Savenkov. Normalization and optimization of schema map-
    pings. Very Large Database J., 20(2):277–302, 2011.
10. E. Kharlamov, M. Giese, E. Jiménez-Ruiz, M. G. Skjæveland, A. Soylu, D. Zheleznyakov,
    T. Bagosi, M. Console, P. Haase, I. Horrocks, S. Marciuska, C. Pinkel, M. Rodriguez-Muro,
    M. Ruzzi, V. Santarelli, D. F. Savo, K. Sengupta, M. Schmidt, E. Thorstensen, J. Trame,
    and A. Waaler. Optique 1.0: Semantic access to big data: The case of Norwegian Petroleum
    Directorate’s FactPages. In Proc. of the ISWC Posters & Demos Track, pages 65–68, 2013.
11. B. Konev, R. Kontchakov, M. Ludwig, T. Schneider, F. Wolter, and M. Zakharyaschev. Con-
    junctive query inseparability of OWL 2 QL TBoxes. In Proc. of the 25th AAAI Conf. on
    Artificial Intelligence (AAAI), 2011.
12. D. Lembo, J. Mora, R. Rosati, D. F. Savo, and E. Thorstensen. Towards mapping analysis
    in ontology-based data access. In Proc. of the 8th Int. Conf. on Web Reasoning and Rule
    Systems (RR), pages 108–123, 2014.
13. A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking
    data to ontologies. J. on Data Semantics, X:133–173, 2008.
14. M. Rodriguez-Muro, R. Kontchakov, and M. Zakharyaschev. Ontology-based data access:
    Ontop of databases. In Proc. of the 12th Int. Semantic Web Conf. (ISWC), 2013.