=Paper= {{Paper |id=Vol-2211/paper-28 |storemode=property |title=Provenance in Ontology-based Data Access |pdfUrl=https://ceur-ws.org/Vol-2211/paper-28.pdf |volume=Vol-2211 |authors=Ana Ozaki,Rafael Peñaloza |dblpUrl=https://dblp.org/rec/conf/dlog/OzakiP18 }} ==Provenance in Ontology-based Data Access== https://ceur-ws.org/Vol-2211/paper-28.pdf
     Provenance in Ontology-based Data Access

                         Ana Ozaki and Rafael Peñaloza

          KRDB Research Centre, Free University of Bozen-Bolzano, Italy


      Abstract. We present an approach for dealing with provenance informa-
      tion in an ontology-based data access (OBDA) setting. Our approach is
      based on provenance semirings, which were studied in database theory as
      an abstract tool to relate the result of a query with the possible matches
      of the query in the data. We investigate the problems of (i) deciding
      whether an ontology annotated with provenance information entails a
      conjunctive query associated with a provenance polynomial, and (ii) com-
      puting a polynomial representing the provenance of a query entailed by
      a provenance annotated ontology. We show that these polynomials may
      be infinite in general. We then study a special case where the semiring
      is idempotent and hence the polynomial is guaranteed to be finite. We
      instantiate these problems to DL-LiteR and provide some complexity
      results.


1    Introduction
Description logics (DLs) [3] have been sucessfully applied to integrate data
coming from multiple and heterogeneous sources. In this setting, commonly
known as ontology-based data access (OBDA), an ontology enriches the data
with background knowledge, providing the user with a high-level conceptual view
of the data, among other beneficial properties [9, 13]. Data integration can be
useful to support users with a convenient vocabulary for queries, and help them
to obtain accurate results.
    Querying over multiple data sources also increases the challenge of establishing
the authorship and of determining the reliability of the data. More importantly,
this challenge transfers to all the (implicit) consequences that can be derived
from the data and the ontology. In this scenario, one may not just be interested
in knowing the result of a query, but also how it was produced, how likely or
reliable the result may be, how many different ways are there to derive it, or how
dependent it is to certain parts of the data, among many other questions [18].
    To address all these issues, provenance semirings were introduced in database
theory [8] as an abstract tool to record and track provenance information; that is,
to keep track of the specific database tuples that are responsible for a derivation,
and additional information associated to them. In this work, we present an
approach based on provenance semirings for dealing with provenance information
in an OBDA setting. Following the approach from database theory, we assume
that facts are annotated with a label and we want to know which combinations
of these labels lead to the entailment of a query. Such information is expressed in
the form of a provenance polynomial, as we illustrate in the following example.
Example 1. Consider the DL assertions

       headGov(Renier, Venice), headGov(Brugnaro, Venice), City(Venice)

being annotated with the sources p, q, and r respectively. Based on these assertions,
the answer to the query ∃xy.(headGov(x, y) ∧ City(y)) can be derived by using
any of the first two assertions (the role assertions) together with the last assertion.
Based on the source annotations, this information can be expressed through the
provenance polynomial (p ⊕ q) ⊗ r.
   In our approach for an OBDA setting, we do not only label the facts in the
ABox, but also the concept inclusions from the TBox. These labels are interpreted
and propagated according to a semiring semantics, as we illustrate in the next
example.
Example 2. Consider the inclusion ∃headGov.> v Mayor annotated with the
source s. Based on this inclusion and the annotated assertions from Example 1,
the answer to ∃x.(Mayor(x)) can be derived by using any of the first two assertions
together with the inclusion. Based on the source annotations, this information
can be expressed through the provenance polynomial (p ⊕ q) ⊗ s.
    We investigate (i) the complexity of deciding whether an ontology annotated
with provenance information entails a conjunctive query associated with a prove-
nance polynomial, and (ii) the problem of computing polynomials representing the
provenance of a query entailed by a provenance annotated ontology. We instanti-
ate these problems to DL-LiteR , the logic underpinning the OWL 2 QL profile [1],
and show that, for the first problem, the complexity remains NP-complete in
combined complexity, and in LogSpace in data complexity (as in the classical
query answering problem in DL-LiteR ). Regarding problem (ii), we show that
the set of polynomials representing provenance may be infinite in general. So
we study a special case where the semiring is multiplicatively-idempotent and
the set of polynomials is guaranteed to be finite. We show that computing the
provenance polynomials can be hard, i.e., there is a DL-LiteR ontology and a
query such that the set of provenance polynomials cannot be represented with
polynomial space.


2    Basic Definitions
Following the original ideas from provenance semirings in databases [8], we repre-
sent the provenance information through a so-called positive algebra provenance
semiring (or provenance semiring for short). Formally, given a set X of variables,
the provenance semiring is the algebra S = (N[X], ⊕, ⊗, 0, 1), where the product
⊗ and the addition ⊕ are two commutative, and associative binary operators,
and the product distributes over the addition. These operators are defined as
usual over the space N[X] of finite polynomials with integer coefficients on the
variables X. We denote by NP the set of all (finite) polynomials over the variables
X that are of expanded form; that is NP contains only polynomials of the form
P         Q
  1≤i≤n    1≤j≤m ai,j , with ai,j ∈ X, and n, m > 0. In words, a polynomial in
expanded form is a finite set of monomials, each formed by a finite product of
variables in X. Notice that distributivity of product over additions entails that
every polynomial can be equivalently rewritten in expanded form; however, the
expanded form of a polynomial may become exponentially larger. For instance,
the polynomial (p ⊕ q) ⊗ r from Example 1 is equivalent to (p ⊗ r) ⊕ (q ⊗ r).

2.1   Provenance Annotated Ontologies
The provenance information for each axiom in an ontology is stored in the form
of an annotation. For the scope of this paper, we focus on ontologies written in
the description logic DL-LiteR [2]. Consider three mutually disjoint countable
sets of concept names NC , role names NR , and individual names NI . A DL-LiteR
concept (resp. role) assertion is an expression of the form A(a) (resp. R(a, b)),
with A ∈ NC (resp. R ∈ NR ) and a, b ∈ NI . DL-LiteR role and concept inclusions
are expressions of the form S v T and B v C, respectively, where S, T are role
expressions and B, C are concepts built according to the grammar rules

       S ::= R | R−        T ::= S | ¬S      B ::= A | ∃S       C ::= B | ¬B,

with R ∈ NR and A ∈ NC . A DL-LiteR axiom is either a DL-LiteR assertion,
role inclusion, or concept inclusion. A provenance annotated DL-LiteR ontology
is a set of annotated axioms of the form (α, p) where α is a DL-LiteR axiom
and p ∈ X ⊆ NP is a variable of a provenance semiring, such that each variable
appears in at most one axiom. We call DL-LitePR this annotated extension of
DL-LiteR .
    The semantics of DL-LitePR is given by interpretations, which extend the
classical interpretations of DL-LiteR to track provenance, when relevant. Formally,
a DL-LitePR interpretation is a triple I = (∆I , ∆Ip , ·I ) where ∆I is a non-empty
set (called the domain of I), ∆Ip is a non-empty set (called the domain of
polynomials of I), and ·I is the interpretation function mapping
 – every p, q ∈ NP to some pI , q I in the set ∆Ip with the condition that pI = q I
   iff the polynomials p and q are mathematically equal (e.g., (p⊗q)I = (q ⊗p)I );
 – every A ∈ NC to some AI ⊆ ∆I × ∆Ip ; and
 – every R ∈ NR to some RI ⊆ ∆I × ∆I × ∆Ip .

We extend the mapping ·I to further DL-LiteR expressions in the natural way:

  (R− )I = {(e, d, pI ) | (d, e, pI ) ∈ RI } , (¬S)I = (∆I × ∆I × ∆Ip ) \ S I ,
  (∃S)I = {(d, pI ) | ∃e ∈ ∆I : (d, e, pI ) ∈ S I } and (¬B)I = (∆I × ∆Ip ) \ B I .

The DL-LitePR interpretation I satisfies: (A(a), p) (respectively (R(a, b), p)) if
(a, pI ) ∈ AI (resp. (a, b, pI ) ∈ RI ); (C v D, p) if, for all q ∈ NP , (d, q I ) ∈ C I
implies that (d, (q ⊗ p)I ) ∈ DI ; and (S v T , p) if (d, e, q I ) ∈ S I implies that
(d, e, (q ⊗ p)I ) ∈ T I . I satisfies an annotated ontology O, in symbols I |= O, if
it satisfies all annotated axioms in O.
Example 3. Consider the ontology O with the assertions of Example 1 and
the inclusion of Example 2. Let I be a DL-LitePR interpretation with domain
∆I = {Renier, Venice, Brugnaro} and ∆Ip containing {p, q, r, s, (p ⊗ s), (q ⊗ s)},
which interprets individual names and polynomials in O by themselves, and
 – headGovI = {(Renier, Venice, p), (Brugnaro, Venice, q)},
 – MayorI = {(Renier, p ⊗ s), (Brugnaro, q ⊗ s)},
 – CityI = {(Venice, r)}.
I is a model of O.

2.2    Provenance Annotated Queries
We extend the notion of conjunctive queries in DLs by allowing binary and
ternary predicates, where the last term of the tuple can either be a variable or
a monomial in NP (by definition of the semantics of DL-LitePR , tuples can only
contain monomials, not sums). More specifically, a Boolean conjunctive query
(BCQ) q is a sentence ∃~x.ϕ(~x, ~a, p~), where ϕ is a conjunction of atoms of the
form A(t1 , t), R(t1 , t2 , t), and ti is either an individual name from ~a, or a variable
from ~x, and t (the last term of each tuple) is either an element of NP in the list
p~ or a variable from ~x. We often write P (~t) to refer to an atom which can be
either A(t1 , t) or R(t1 , t2 , t) and write P (~t) ∈ q if P (~t) is an atom occurring in q.
    A match of the BCQ q = ∃~x.ϕ(~x, ~a, p~) in the DL-LitePR interpretation I is a
function π mapping ~x ∪ ~a ∪ p~ to ∆I ∪ ∆Ip , such that, for all b ∈ ~a ∪ p~, π(b) = bI
and, for every atom P (~t) ∈ q, π(~t) ∈ P I . The interpretation I satisfies the BCQ
q, written I |= q, if there is a match of q in I. A BCQ is entailed by O if it is
satisfied by every model of O. For a BCQ q and an interpretation I, we denote
by νI (q) the set of all matches of q in I. The provenance of q on I, denoted
provI (q), is the (potentially infinite) expression:
                                  X       Y
                                               provI (P (π(~t)))
                             π∈νI (q) P (~
                                         t)∈q

where provI (P (π(~t))) is the last element of the tuple π(~t) ∈ P I . For p ∈ NP ,
we write p ⊆ provI (q) if p is a sum of monomials and for each occurrence of a
monomial in p we find an occurrence of it in provI (q). We say that I satisfies q
with provenance p ∈ NP , in symbols I |= (q, p), if I |= q and p ⊆ provI (q). We
say that a DL-LitePR ontology O entails q, in symbols O |= q if for all DL-LitePR
interpretations I, if I |= O then I |= q; and O |= (q, p) if O |= q and for all
DL-LitePR interpretations I satisfying O we have that p ⊆ provI (q).
    We often write |Y | to denote the size of a DL-LitePR ontology, a polynomial
or a BCQ Y , defined as the length of the string that represents Y , where role
and concept names are considered to be of length one.

2.3    Reasoning Problems
We consider the problem of query entailment w.r.t. a provenance polynomial,
defined as follows: given an ontology O in an ontology language L, a query q and
a polynomial p ∈ NP we want to decide whether O |= (q, p). Another important
and related question is how to compute the provenance of a query. We define the
problem of computing the provenance of a query as follows: given an ontology O
in an ontology language L and a query q, we want to compute the set of all p ∈ NP
such that O |= (q, p). In our formalism, this second problem depends on whether
there is a finite set of polynomials which we can compute. The following example
shows that in DL-LitePR the set of provenance polynomials can be infinite.
Example 4. Consider a DL-LitePR ontology O with the annotated assertions of
Example 1, the annotated inclusion of Example 2 and (Mayor v ∃headGov.>, t).
Then, for all n ∈ N, it holds that O |= (Mayor(Renier), p ⊗ sn+1 ⊗ tn ). Indeed, any
model I of O satisfies the mentioned axioms, so (Renier, (p⊗s)I ) ∈ MayorI implies
(a, (p ⊗ s ⊗ t)I ) ∈ (∃headGov.>)I , which implies (Renier, (p ⊗ s2 ⊗ t)I ) ∈ MayorI ,
and so on.
    In Sections 3 and 4 we consider the problem of query entailment w.r.t. a
provenance polynomial. Notice that in Example 4, if the semiring is multiplica-
tively idempotent (i.e., s ⊗ s = s), then the set of provenance polynomials is
finite; indeed, the only polynomial will be p ⊗ s ⊗ t. This is not a coincidence;
in fact, under multiplicative-idempotency, the set of provenance polynomials is
always finite. The following proposition states that multiplicative-idempotency is
not only necessary (as shown in Example 4) but also sufficient to guarantee that
the set of polynomials is finite.
Proposition 5. Under multiplicative-idempotency, for any satisfiable DL-LitePR
ontology O and BCQ q, the set of p ∈ NP such that O |= (q, p) is finite.

Sketch. Under multiplicative-idempotency, for any DL-LitePR ontology O and
BCQ q, the number of possible monomials occurring p ∈ NP such that O |= (q, p) is
finite. Thus, the only possibility for the set to be infinite is if these monomials can
repeat an unlimited number of times. To entail such polynomials, the arbitrarily
large number of repetitions should happen in all models of O. However, under
multiplicative-idempotency, DL-LitePR enjoys the finite domain property. That is,
if a DL-LitePR ontology has a model then it has a model I with ∆I finite.

   In Section 5 we study the properties gained under idempotent semirings. In
particular, we consider the problem of computing the provenance of a query.


3    Characterization

In this section, we show how to reduce the problem of query entailment w.r.t.
a provenance polynomial to the query entailment problem in standard DLs for
a particular class of provenance annotated DL-LitePR ontologies which we call
marked. We use our results for marked ontologies, in Section 4, to solve the query
answering problem for provenance annotated DL-LitePR ontologies in general. We
say that a DL-LitePR ontology Om is marked if there is a DL-LitePR ontology O
such that Om is the result of:
 – replacing each (R(a, b), p) by (Ra,b (a, b), p), where Ra,b is a fresh role name;
 – for all a, b ∈ NI and all R ∈ NR occurring in O, adding a concept inclusion
       (−)
   (∃Ra,b v C, v) for each (∃R(−) v C, u) ∈ O, where v is fresh;
 – for all a, b ∈ NI and all R, S ∈ NR occurring in O, adding a role inclusion
      (−)      (−)
   (Ra,b v Sa,b , v) for each (R(−) v S (−) , u) ∈ O, where v is fresh.
    Intuitively, we want to ensure that there is a model of Om in which elements
in the anonymous part (i.e., not in the image of NI ) connected (via roles) to the
image of an individual are associated with monomials containing at least one
variable of the semiring which is not shared by anonymous elements connected
to the image of another individual. In other words, we want to ‘mark’ monomials
associated to elements derived from assertions of named individuals.
    We now show that given a marked ontology O, a BCQ q and a polynomial
p ∈ NP , we can translate the query q and the polynomial p into a set of queries
such that O entails (q, p) iff it entails at least one of the queries from this set. We
first show how to translate a BCQ where all terms are variables (no individual
names and no polynomials). For a BCQ q = ∃~x ϕ(~x) with m atoms and p ∈ NP
with n monomials, we define Tr(q, p) as the set of all BCQs:
                                          ^
                                    ∃~y        ϕi (x~i ),                            (1)
                                         1≤i≤n

where ~y = x~1 , . . . , x~n and each qi = ∃x~i ϕi (x~i ) is a ‘copy’ of q in which we replace
each variable x ∈ ~x by a fresh variable xi ∈ x~i . It remains to check whether
we can find the monomials of the polynomial in these matches. We do this by
replacing the last variable in each j-th atom Q         of qi by a monomial pi,j built from
symbols in X ⊆ NP occurring P        in  p such that       1≤j≤m pi,j = pi for some pi ∈ NP ,
with 1 ≤ i ≤ n; and 1≤i≤n pi = p.
   The translation of a BCQ with individual names is done in a similar manner,
except that we also have to add such individual names in each copy of the query,
that is, we would replace the corresponding variable in the translation with the
individual name occurring in the query. Theorem 8 formalises the correctness of
our translation. In the following, we write O |= Tr(q, p) (resp. I |= Tr(q, p)) to
express that there is q 0 ∈ Tr(q, p) such that O |= q 0 (resp. I |= q 0 ).
Example 6. Consider the query q = ∃xyzw.(headGov(x, y, z) ∧ City(y, w)) and
the polynomial p = (s ⊗ t) ⊕ (s ⊗ r). Then,

  ∃x1 y1 x2 y2 .(headGov(x1 , y1 , s) ∧ City(y1 , t) ∧ headGov(x2 , y2 , s) ∧ City(y2 , r))

is in Tr(q, p).
   To show Theorem 8 we use the following technical lemma. whose proof uses
the construction of the canonical model and the assumption that the DL-LitePR
ontology is in marked form.
Lemma 7. Let O be a satisfiable DL-LitePR marked ontology, q a BCQ and p a
polynomial in NP . If O |= (q, p) then for any two monomials p1 , p2 appearing in
p, it holds that p1 and p2 are mathematically distinct.
   With the help of this lemma, we can state the main result of this section;
namely, that the translation for marked ontologies is correct.
Theorem 8. Let O be a DL-LitePR ontology, q a BCQ and p ∈ NP a polynomial
formed of mathematically distinct monomials. Then,

                      O |= (q, p) if, and only if, O |= Tr(q, p).

   In the next section, we use our characterization for marked ontologies to
present a query rewriting method for answering provenance queries.


4    Query Rewriting
We now show how to adapt the classical query rewriting algorithm PerfectRef [7]
to deal with provenance annotated ontologies and queries in DL-LitePR . We use
our characterization in Section 3 for marked ontologies and restrict to the case
where BCQs q are such that, for all P (~t) ∈ q, the last element of ~t is a monomial in
NP (i.e., variables do not occur in the last parameter of an atom). In Theorem 11
we explain how we solve the problem of query entailment w.r.t. a provenance
polynomial for arbitrary DL-LitePR ontologies and annotated queries. Whenever
possible, we use the same definitions and terminology in [7, Section 5.1], adapting
some of them to our setting, when necessary.
    In the following, the symbol “− ” represents non-distinguished non-shared
variables. A positive inclusion I is a provenance annotated role or concept
inclusion without negations. We say that I is applicable to an atom A(x, p) if I
is annotated with v occurring in p and it has A in its right-hand side. A positive
inclusion I is applicable to an atom R(x, y, p) if (i) x =− , I is annotated with
v occurring in p, and the right-hand side of I is ∃R, or (ii) I is a role inclusion
annotated with v occurring in p and its right-hand side is either R or R− .
    To simplify the presentation, for each role R− occurring in a marked DL-
LitePR ontology O, we add to O the annotated role inclusions (R− v R, pR ) and
(R v R− , p0R ), where R is a fresh role name and pR , p0R are fresh variables of a
provenance semiring. We can then assume w.l.o.g. that inverse roles only occur
in such role inclusions by replacing any other occurrence of R− with R. Given a
monomial p ∈ NP and a variable v ∈ X ⊆ NP of the semiring occurring in p, we
denote by p|v the result of removing one occurrence of v from p.
Definition 9. Let g be an atom and I be a positive inclusion that is applicable
to g. Then, the atom obtained from g by applying I, denoted by gr(g, I), is:
 – gr(g, I) = A1 (x, p|v ), if g = A(x, p) and I = (A1 v A, v);
 – gr(g, I) = R(x,− , p|v ), if g = A(x, p) and I = (∃R v A, v);
 – gr(g, I) = A(x, p|v ), if g = R(x,− , p) and I = (A v ∃R, v);
 – gr(g, I) = R1 (x,− , p|v ), if g = R(x,− , p) and I = (∃R1 v ∃R, v);
 – gr(g, I) = R1 (x, y, p|v ), if g = R(x, y, p) and I = (R1 v R, v);
 – gr(g, I) = R1 (y, x, p|v ), if g = R(x, y, p) and either I = (R1 v R− , v) or
   I = (R1− v R, v).
Algorithm 1 Algorithm PerfectRef
Input: a BCQ q, a set of positive inclusions OT
Output: a union of BCQs P R
 1: P R := {q}
 2: repeat
 3:    P R0 := P R
 4:    for all q ∈ P R0 do
 5:       for all g ∈ q do
 6:           for all I ∈ OT do
 7:               if I is applicable to g then
 8:                   P R := P R ∪ {q[g/gr(g, I)]}
 9:       for all g1 , g2 ∈ q do
10:           if g1 and g2 unify then
11:               P R := P R ∪ {τ (reduce(q, g1 , g2 ))}
12: until P R0 = P R
13: return P R


    We use the same algorithm PerfectRef (Algorithm 1) originally presented
in [7], except that the notion of applicability of a positive inclusion I to an atom
g is as previously described and gr(g, I) is as in Definition 9. Let q[g/g 0 ] denote
the BCQ obtained from q by replacing the atom g with a new atom g 0 . Let τ
be a function that takes as input a BCQ q and returns a new BCQ obtained by
replacing each occurrence of an unbound variable in q with the symbol ‘− ’; and
let reduce be a function that takes as input a BCQ q and two atoms g1 , g2 and
returns a BCQ obtained by applying to q the most general unifier between g1
and g2 . We denote by PerfectRef(q, OT ) the output of the algorithm PerfectRef
with a BCQ q (with a monomial in NP in the last parameter of each atom) and
a set OT of positive inclusions of a marked DL-LitePR ontology O as input.
Example 10. Consider a DL-LitePR ontology O containing the annotated asser-
tions of Example 1 and the annotated inclusion I of Example 2. Assume that
Algorithm 1 receives OT and q = ∃x.Mayor(x, p⊗s) as input. Since I is applicable
to g = Mayor(x, p ⊗ s), in Line 8, Algorithm 1 adds to P R the result of replacing
g by gr(g, I) = headGov(x,− , p) in q. Hence we get that

                  q † = ∃x, y headGov(x, y, p) ∈ PerfectRef(q, OT ).

Indeed q † is a rewriting of q.
    Termination of our modified version of PerfectRef follows the same lines
as [7, Lemma 34], except that now the number of terms is exponential in the
size of monomials occurring in the query, and thus, in the size of the query. This
is due to Definition 9, where we ‘break’ the monomial into a smaller one. Our
modification does not change the upper bounds obtained with the algorithm,
since in data complexity only the data is considered as part of the input (so the
query is fixed) and the upper bound for combined complexity, which we establish
in Theorem 11, is obtained by a non-deterministic version of the algorithm.
Theorem 11. In DL-LitePR , answering provenance annotated queries is NP-com-
plete (combined complexity).

    The following theorem states the complexity of answering provenance anno-
tated queries in DL-LitePR when (i) only provenance annotated role and concept
inclusions are considered as the input, and (ii) only the assertions are the input.

Theorem 12. In DL-LitePR , answering provenance annotated queries is (i) in
PTime in the size of the role and concept inclusions, and (ii) in LogSpace in
the size of the assertions (data complexity).

Proof. These complexity results follow from the fact that: (1) satisfiability of
DL-LitePR is in LogSpace, as in DL-LiteR [2]; (2) the query rewriting can be
computed in time polynomial in the number of positive inclusions in a DL-LitePR
ontology O and in constant time in the number of assertions in O; (3) (our
modified version of) PerfectRef is correct [7] (see Theorem 11); and (4) evaluation
of a query over a database can be computed in LogSpace w.r.t. the size of the
database (follows from the fact that queries are a fragment of first-order logic
and their size is fixed for the complexity results in this theorem).


5   Idempotency

We now turn our attention to a special case, where the semiring describing the
provenance of knowledge is multiplicatively-idempotent; that is, the operation ⊗
is such that for every polynomial p ∈ NP , p ⊗ p = p. Such would be the case, for
example, if the provenance refers to the name of the source of the knowledge;
then, having several times the same name does not affect the result. Alternatively,
one can consider access rights to observe certain pieces of knowledge from the
ontology. For the scope of this section, we remove the assumption that the labels
of each of the axioms are unique. That is, while every axiom is still labelled with
one variable from the provenance space, several axioms may share the same label.
This is consistent, again, with the idea of the labels representing the source or
the accessibility level of a piece of knowledge.
     Consider first the problem of computing the provenance of a query, as described
in Section 2. Notice, however, that whenever the semiring is fully idempotent (that
is, p ⊕ p = p also holds), the task can be simplified to the computation of relevant
monomials. More precisely, in this special case we are interested in computing
the set of all monomials p such that O |= (q, p). Clearly, the polynomial of the
query is the result of adding all these monomials. This definition is equivalent to
the general one since the semiring is idempotent: repetitions of the monomials
do not affect the result, and repetitions of a variable in a given monomial can
be removed. If the semiring is only multiplicatively idempotent, then computing
monomials does not suffice, since some of them may need to appear several times.
However, the problem is still simplified to find the (finite) number of repeated
monomials to be observed.
    The first thing that we notice is that, in general, we cannot avoid an exponen-
tial runtime of any method computing this polynomial, since it may require the
addition of exponentially many monomials, as shown in the following example.
Example 13. Consider the DL-LitePR ontology O containing the axioms:
      (A v B1 , x),      (A v C1 , x),        (Bn v D, x),        (Cn , v D, x),     (A(a), p),
 (Bi v Bi+1 , xi ),   (Bi v Ci+1 , yi ),   (Ci v Bi+1 , xi ),   (Ci v Ci+1 , yi ),   1 ≤ i < n,
                                                                        Qn
and the simple query q = D(a). Notice that every monomial p = x ⊗ i=1 zi ,
where each zi ∈ {xi , yi } is such that O |= (q, p) (and none other). Hence, the
polynomial of the query q is formed by the sum of 2n different monomials; that
is, it is exponential on the size of the ontology.
    Perhaps more interesting, though, is that the provenance of some queries can-
not be expressed through a provenance polynomial whose length is polynomial on
the size of the ontology, even if we allow the expression to not be in expanded form.
This is, in fact, a direct consequence of the results by Karchmer and Wigderson
on monotone complexity [11, 12]. In a nutshell, Karchmer and Wigderson show
that there is no monotone Boolean formula of polynomial length that can express
all the paths between two nodes in a graph.1 In fact, the result holds already for
the special case of a complete graph. As hinted in Example 13, graphs can be
described in DL-LiteR (and in even simpler logics) using basic inclusion axioms.
Moreover, monotone Boolean formulas can be seen as a provenance polynomial
over an idempotent semiring; indeed, the conjunction ∧ serves as product and ∨
as the addition in this algebra, and both operators are idempotent. Hence we
have the following result (see also [14]).

Proposition 14. There exists a DL-LiteR ontology O and a query q such that
the provenance polynomial of q w.r.t. O cannot be represented in polynomial
space. The result holds even if the semiring is idempotent, and every axiom in O
has a unique label.

Proof. Let N be a set with n concept names such that A, B ∈ N . Define the
ontology
               O = {(C v D, xCD ) | C, D ∈ N } ∪ {A(a), xA },
and the query q = B(a). Notice that O simulates a complete graph over the
nodes in N . Every derivation of B(a) represents a path from A to B in this graph.
Hence, if the provenance of q could be expressed with polynomial space, there
would be a monotone Boolean formula representing all the paths from A to B in
the complete graph with n nodes, contradicting the results from [11, 12].

   If in addition we assume as in the previous sections that every axiom has
a unique variable as a provenance label, then it is possible to show that the
provenance polynomial for instance queries can be computed efficiently, whenever
1
    A monotone Boolean formula is a propositional formula built using only the connec-
    tives ∧ and ∨ (without negations).
its length does not increase greatly; that is, it can be computed in polynomial
time on the size of the input and the output. The proof of the following lemma
follows the same ideas presented in [15, 16], based on the fact that all the relevant
monomials that form the provenance can be enumerated with polynomial delay.

Lemma 15. The provenance p for an instance query w.r.t. the ontology O can
be computed in polynomial time on the size of O and the polynomial p.

    All the results presented in this section refer to the complexity obtained from
considering the whole ontology as an input. In fact, more precise complexity
analyses based on fixing some of the parameters are left for future work. Moreover,
most of the queries used were simple instance or Boolean queries. In particular,
this means that all the hardness results can be transferred directly to more
complex (conjunctive) queries. On the other hand, it is worth noting that the
upper bounds developed in the previous section for the general case, obviously
apply also in the restricted setting of idempotency.


6   Conclusions

In this paper, we have studied the problem of provenance in an OBDA scenario
where the data and the ontological axioms are annotated with provenance infor-
mation. In particular, we have shown that query rewriting techniques developed
for the classical OBDA setting can be adapted to handle provenance as well. Using
this argument, we were able to study the complexity of finding the provenance of
a query. We also studied special cases in which the operators from the provenance
semiring are idempotent.
    It is important to emphasise that, despite some apparent similarities, the
problem of provenance is very different from that of axiom pinpointing [6, 10, 17].
Although both in problems we are interested in tracing the causes of a consequence,
axiom pinpointing focuses on those that use minimal sets of axioms. In contrast,
all possible derivations are relevant for provenance, independently of whether
a subset of axioms might already suffice. Hence, although the set of monotone
Boolean formulas is a semiring, the provenance polynomial over this semiring
does not necessarily coincide with the pinpointing formula [4,5]. The main reason
for this difference is that the provenance label represents a key which can be
used to encode other information or values, such as trust or probability; hence
each derivation contributes to the final value represented by the polynomial.
    As future work, we plan to study the properties of provenance-based OBDA
in more detail. In particular, we are interested in understanding how the choice of
the semiring affects the computational complexity of answering different problems,
and obtain a more fine-grained analysis of the complexity results. We also plan
to investigate our approach for dealing with provenance information in more
expressive logics.
References
 1. OWL 2 Web Ontology Language Profiles (Second Edition). Technical report, W3C,
    2012.
 2. Alessandro Artale, Diego Calvanese, Roman Kontchakov, and Michael Za-
    kharyaschev. The DL-Lite family and relations. Journal of artificial intelligence
    research, 36(1):1–69, 2009.
 3. Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter
    Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation,
    and Applications. Cambridge University Press, second edition, 2007.
 4. Franz Baader and Rafael Peñaloza. Automata-based axiom pinpointing. J. Autom.
    Reasoning, 45(2):91–129, 2010.
 5. Franz Baader and Rafael Peñaloza. Axiom pinpointing in general tableaux. J. Log.
    Comput., 20(1):5–34, 2010.
 6. Franz Baader, Rafael Peñaloza, and Boontawee Suntisrivaraporn. Pinpointing in
    the description logic EL+ . In KI, pages 52–67, 2007.
 7. Diego Calvanese, Guiseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and
    Riccardo Rosati. Tractable reasoning and efficient query answering in description
    logics: The DL-Lite family. J. of Automated Reasoning, 39(3):385–429, 2007.
 8. Todd J. Green, Gregory Karvounarakis, and Val Tannen. Provenance semirings.
    In Proceedings of the Twenty-Sixth ACM SIGACT-SIGMOD-SIGART Symposium
    on Principles of Database Systems, June 11-13, 2007, Beijing, China, pages 31–40,
    2007.
 9. Stijn Heymans, Li Ma, Darko Anicic, Zhilei Ma, Nathalie Steinmetz, Yue Pan, Jing
    Mei, Achille Fokoue, Aditya Kalyanpur, Aaron Kershenbaum, Edith Schonberg,
    Kavitha Srinivas, Cristina Feier, Graham Hench, Branimir Wetzstein, and Uwe
    Keller. Ontology reasoning with large data repositories. In Ontology Management,
    2008.
10. Aditya Kalyanpur, Bijan Parsia, Matthew Horridge, and Evren Sirin. Finding all
    justifications of OWL DL entailments. In ISWC, pages 267–280, 2007.
11. M. Karchmer and A. Wigderson. Monotone circuits for connectivity require super-
    logarithmic depth. SIAM Journal on Discrete Mathematics, 3(2):255–265, 1990.
12. Mauricio Karchmer and Avi Wigderson. Monotone circuits for connectivity require
    super-logarithmic depth. In Proceedings of the Twentieth Annual ACM Symposium
    on Theory of Computing, STOC ’88, pages 539–550, New York, NY, USA, 1988.
    ACM.
13. Roman Kontchakov, Mariano Rodríguez-Muro, and Michael Zakharyaschev.
    Ontology-based data access with databases: A short course. In Proceedings of
    the 9th International Conference on Reasoning Web: Semantic Technologies for
    Intelligent Data Access, RW’13, pages 194–229, 2013.
14. Rafael Peñaloza. Axiom pinpointing in description logics and beyond. PhD thesis,
    Dresden University of Technology, 2009.
15. Rafael Peñaloza. Inconsistency-tolerant instance checking in tractable description
    logics. In Stefania Costantini, Enrico Franconi, William Van Woensel, Roman
    Kontchakov, Fariba Sadri, and Dumitru Roman, editors, Proceedings of the Inter-
    national Joint Conference on Rules and Reasoning (RuleML+RR 2017), volume
    10364 of Lecture Notes in Computer Science, pages 215–229. Springer, 2017. doi:
    https://doi.org/10.1007/978-3-319-61252-2_15.
16. Rafael Peñaloza and Barış Sertkaya. Understanding the complexity of axiom
    pinpointing in lightweight description logics. Artificial Intelligence, 250:80–104,
    September 2017. doi: https://doi.org/10.1016/j.artint.2017.06.002.
17. Stefan Schlobach and Ronald Cornet. Non-standard reasoning services for the
    debugging of description logic terminologies. In Proceedings of the 18th Interna-
    tional Joint Conference on Artificial Intelligence, IJCAI’03, pages 355–360. Morgan
    Kaufmann Publishers Inc., 2003.
18. Pierre Senellart. Provenance and probabilities in relational databases. SIGMOD
    Record, 46(4):5–15, 2017.