Ontology-Based Access to Probabilistic Data

                     Jean Christoph Jung and Carsten Lutz

                          Universität Bremen, Germany
                   {jeanjung,clu}@informatik.uni-bremen.de


      Abstract. We propose a framework for querying probabilistic data in
      the presence of an ontology, arguing that the interplay of probabilities
      and ontologies is fruitful in applications such as managing data that
      was extracted from the web. The prime inference problem is computing
      answer probabilities, and we show that it can be implemented using
      standard probabilistic database systems, similar to traditional ontology-
      based data access. We demonstrate that query rewriting into first-order
      logic is an important tool for our framework. First, it is used to establish
      a PTime vs. #P dichotomy for the data complexity of this problem by
      lifting a corresponding result from probabilistic databases. Then, we use
      it to characterize which pairs of query and TBox are in PTime. Finally,
      it is shown that non-existence of such a rewriting implies #P-hardness.


1   Introduction
In recent years, ontology based data access (OBDA) has become an active area of
description logic research. In OBDA, an ontology provides a semantics for incom-
plete data with the aim of facilitating the computation of more complete answers
to queries. There are applications, though, in which it is necessary to query data
that is not only incomplete, but also uncertain. For instance, data extracted
from web sources [24] such as an estate agents’ web page is typically incomplete.
It is also uncertain because web sources tend to be unreliable and extraction
tools are based on heuristic decisions and thus significantly error prone. In this
paper, we propose an extension of OBDA that captures uncertain data through
a probabilistic data model and replaces the computation of certain answers with
computing the probabilities of certain answers. In brief, our approach relates
to probabilistic database systems (PDBMSs) in the same way that traditional
OBDA relates to RDBMSs.
Framework. We consider probabilistic ABox formalisms that are inspired by
data models from the currently very active area of probabilistic databases [7,32].
Specifically, pABoxes enrich classical ABoxes with probabilities that are attached
to probabilistic events, and with event expressions that are attached to ABox
assertions. For example, a pABox assertion SoccerPlayer(messi) can be associ-
ated with an event expression e1 ∨ e2 , where e1 and e2 represent events such
as ‘web extraction tool x correctly analyzed webpage y stating that Messi is
a soccer player’. Event e1 can then be associated with probability .7, and e2
with .9. Events are assumed to be probabilistically independent, which results
2                                Jean Christoph Jung and Carsten Lutz

in a straightforward semantics that is similar to well-known probabilistic versions
of datalog [28,12]. Ontologies are represented by description logic TBoxes. In this
setting, which we call ontology based access to probabilistic data (pOBDA), we
are interested in computing the probabilities of certain answers to conjunctive
queries (CQs). Note that uncertainty occurs only in the data, but neither in the
ontology nor in the query. We believe that pOBDA is of general interest and
potentially useful for a wide range of applications including the management of
data extracted from the web, machine translation, and dealing with data that
arises from sensor networks. All these applications can potentially benefit from a
fruitful interplay between ontologies and probabilities; in particular, the ontology
can help to reduce the uncertainty of the data.

Contributions. The main aim of this paper is to study the data complexity of
pOBDA. More precisely, we pursue a non-uniform approach as recently initiated
in [27], which aims at fully classifying the data complexity of every pair (q, T )
that consists of a CQ q and a TBox formulated in a fixed ‘master DL’. As a
central tool of our analysis, we use query rewriting into first-order (FO) queries,
which is an important technique for traditional OBDA [6]. We start with showing
that FO-rewritings from traditional OBDA are useful also in the context of
pOBDA: for any pABox A, the probability that a tuple a is a certain answer to
q over a pABox A relative to T is identical to the probability that a is an answer
to the FO-rewriting qT of q and T over A viewed as a probabilistic database.
This lifting of FO-rewritings to the probabilistic case immediately implies that
one can implement pOBDA based on existing PDBMSs such as MayBMS, Trio,
and MystiQ [1,35,5].
    Lifting also allows us to carry over the dichotomy between PTime and #P-
hardness for computing the probabilities of answers to unions of conjunctive
queries (UCQs) over probabilistic databases recently obtained by Dalvi, Suciu,
and Schnaitter [8] to our pOBDA framework, provided that we restrict ourselves
to TBoxes formulated in (the core version of) DL-Lite and to ipABoxes, which
are a special case of pABoxes in which all ABox assertions are probabilistically
independent. Based on a careful syntactic analysis, we provide a concrete char-
acterization of those CQs q and DL-Lite TBoxes T for which computing answer
probabilities is in PTime. We then proceed to showing that query rewriting
is a complete tool for proving PTime data complexity in pOBDA, in the fol-
lowing sense: we replace DL-Lite with the strictly more expressive description
logic ELI where, in contrast to DL-Lite, rewritings into first-order queries do
not exist for every CQ q and TBox T ; we then prove that if some (q, T ) does
not have a rewriting, then computing answer probabilities for q relative to T
is #P-hard. Thus, if it is possible at all to prove that some (q, T ) has PTime
data complexity, then this can always be done using query rewriting. Both in
DL-Lite and ELI, the class of queries and TBoxes with PTime data complexity
is relatively small. This negative result is relativized by the fact that answer
probabilities can often be efficiently approximated. In particular, all pairs (q, T )
admit approximation in terms of a fully polynomial randomized approximation
scheme (FPRAS) whenever q is FO-rewritable relative to T . We provide a brief
                                Ontology-Based Access to Probabilistic Data        3

discussion of such approximations and refer to the full version [19] of this paper
for more information.
Related Work. The probabilistic ABox formalism studied in this paper is inspired
by the probabilistic database models in [9], but can also be viewed as a variation
of probabilistic versions of datalog and Prolog, see [28,12] and references therein.
There have recently been other approaches to combining ontologies and uncer-
tainty for data access [11,14], with a different semantics; the setup considered
by Gottlob, Lukasiewicz, and Simari in [14] is close in spirit to the framework
studied here, but also allows probabilities in the TBox and has a different, rather
intricate semantics based on Markov logic. In fact, we deliberately avoid proba-
bilities in the ontology because (i) this results in a simple and fundamental, yet
useful formalism that still admits a very transparent semantics and (ii) it enables
the use of standard PDBMSs for query answering in the presence of ontologies.
Note that the analogous property of being implementable using state-of-the-art
RDBMSs is a favorable feature of traditional OBDA [6]. There has also been
a large number of proposals for enriching description logic TBoxes (instead of
ABoxes) with probabilities, see [25,26] and the references therein. Our running
application example is web data extraction, in the spirit of [16] to store extracted
web data in a probabilistic database. Note that it has also been proposed to inte-
grate both probabilities and ontologies directly into the data extraction tool [13].
We believe that both approaches can be useful and could even be orchestrated
to play together.
   This paper is a condensed version of [19]. For the missing proofs and some
additional material we refer the reader to the long version of [19].


2   Preliminaries

We use standard notation for the syntax and semantics of description logics
(DLs) and refer to [3] for full details. As usual, C, D denote (potentially) com-
posite concepts, A, B concept names, r, s role names, R and S role names or
their inverse, and a, b individual names. When R = r− , then R− denotes r. We
consider the ontology languages DL-Lite and ELI. Regarding the former, we
concentrate on the dialect DL-Litecore , where TBoxes are finite sets of concept
inclusions (CIs) B v B 0 and B u B 0 v ⊥ with B and B 0 concepts of the form
∃r, ∃r− , > or A. In ELI, a TBox is a finite set of CIs C v D where C and D are
(potentially) compound concepts of the form >, A, C 0 u D0 , ∃r.C 0 , and ∃r− .C 0 .
     As usual, an ABox is a finite set of concept assertions A(a) and role asser-
tions r(a, b). We use Ind(A) to denote the set of individual names used in the
ABox A and write r− (a, b) ∈ A for r(b, a) ∈ A. The semantics of DLs is based
on interpretations I = (∆I , ·I ); we adopt the unique name assumption (UNA),
i.e., we require aI 6= bI for all individuals a 6= b.
     Conjunctive queries (CQs) take the form ∃y.ϕ(x, y), with ϕ a conjunction
of atoms of the form A(t) and r(t, t0 ) and where x, y denote (tuples of) variables
and t, t0 denote terms, i.e., a variable or an individual name. We call the variables
4                                 Jean Christoph Jung and Carsten Lutz

in x the answer variables and those in y the quantified variables. The set of all
variables in a CQ q is denoted by var(q) and the set of all terms in q by term(q).
A CQ q is n-ary if it has n answer variables and Boolean if it is 0-ary. Whenever
convenient, we treat a CQ as a set of atoms and sometimes write r− (t, t0 ) ∈ q
instead of r(t0 , t) ∈ q.
    Let I be an interpretation and q a CQ with answer variables x1 , . . . , xk .
For individual names a = a1 · · · ak , an a-match for q in I is a mapping π :
term(q) → ∆I such that π(xi ) = ai for 1 ≤ i ≤ k, π(a) = aI for all individual
names a ∈ term(q), π(t) ∈ AI for all A(t) ∈ q, and (π(t1 ), π(t2 )) ∈ rI for all
r(t1 , t2 ) ∈ q. We write I |= q[a] if there is an a-match of q in I and let ans(q, I)
denote the set of all a with I |= q[a]. For a TBox T and an ABox A, we write
T , A |= q[a] if I |= q[a] for all models I of T and A. The set certT (q, A) of all
certain answers consists of all tuples a over Ind(A) with T , A |= q[a].


3    Probabilistic OBDA
We introduce our framework for probabilistic OBDA, starting with the definition
of a rather general, probabilistic version of ABoxes. Let E be a countably infinite
set of atomic (probabilistic) events. An event expression is built up from atomic
events using the Boolean operators ¬, ∧, ∨. We use expr(E) to denote the set
of all event expressions over E. A probability assignment for E ⊆ E is a map
E → [0, 1].
Definition 1 (pABox). A probabilistic ABox (pABox) is of the form (A, e, p)
with A an ABox, e a map A → expr(E), and p a probability assignment for EA ,
the atomic events in A.
Example 1. We consider as a running example an information extraction tool
that is gathering data from the web, see [16] for a similar setup. Assume we are
gathering data about soccer players and the clubs they play for in the current
2012 season, and we want to represent the result as a pABox.
(1) The tool processes a newspaper article stating that ‘Messi is the soul of the
Argentinian national soccer team’. Because the exact meaning of this phrase
is unclear (it could refer to a soccer player, a coach, a mascot), it generates
the assertion Player(messi) associated with the atomic event expression e1 with
p(e1 ) = 0.7. The event e1 represents that the phrase was interpreted correctly.
(2) The tool finds the Wikipedia page on Lionel Messi, which states that he is
a soccer player. Since Wikipedia is typically reliable and up to date, but not
always correct, it updates the expression associated with Player(messi) to e1 ∨ e2
and associates e2 with p(e2 ) = 0.95.
(3) The tool finds an HTML table on the homepage of FC Barcelona saying
that the top scorers of the season are Messi, Villa, and Pedro. It is not stated
whether the table refers to the 2011 or the 2012 season, and consequently we
generate the ABox assertions playsfor(x, FCbarca) for x ∈ {messi, villa, pedro} all
associated with the same atomic event expression e3 with p(e3 ) = 0.5. Intuitively,
the event e3 is that the table refers to 2012.
                                Ontology-Based Access to Probabilistic Data         5

(4) Still processing the table, the tool applies the background knowledge that
top scorers are typically strikers. It generates three assertions Striker(x) with
x ∈ {messi, villa, pedro}, associated with atomic events e4 , e04 , and e004 . It sets
p(e4 ) = p(e04 ) = p(e004 ) = 0.8. The probability is higher than in (3) since being
a striker is a more stable property than playing for a certain club, thus this
information does not depend so much on whether the table is from 2011 or 2012.
(5) The tool processes the twitter message ‘Villa was the only one to score a
goal in the match between Barca and Real’. It infers that Villa plays either for
Barcelona or for Madrid, generating the assertions playsfor(villa, FCbarca) and
playsfor(villa, realmadrid). The first assertion is associated with the event e5 , the
second one with ¬e5 . It sets p(e5 ) = 0.5.

   We now definte the semantics of OBDA over pABoxes. Each E ⊆ EA can be
viewed as a truth assignment that makes all events in E true and all events in
EA \ E false.
Definition 2. Let (A, e, p) be a pABox. For each E ⊆ EA , define a corre-
sponding non-probabilistic ABox AE := {α ∈ A | E |= e(α)}. The function p
represents a probability distribution on 2EA , by setting for each E ⊆ EA :
                                Y          Y
                       p(E) =       p(e) ·       (1 − p(e)).
                                e∈E        e∈EA \E

The probability of an answer a ∈ Ind(A)n to an n-ary conjunctive query q over
a pABox A and TBox T is
                                         X
                    pA,T (a ∈ q) =                  p(E).
                                      E⊆EA : a∈certT (q,AE )

For Boolean CQs q, we write p(A, T |= q) instead of pA,T (() ∈ q), where ()
denotes the empty tuple.

Example 2. Consider again the web data extraction example discussed above.
To illustrate how ontologies can help to reduce uncertainty, we use the DL-Lite
TBox
             T = { ∃playsfor v Player            Player v ∃playsfor
                    ∃playsfor− v SoccerClub          Striker v Player   }
Consider the following subcases considered above.
(1) + (3) The resulting pABox comprises the following assertions with associated
event expressions:

                     Player(messi)     e1 playsfor(messi, FCbarca)      e3
           playsfor(villa, FCbarca)    e3 playsfor(pedro, FCbarca)      e3

with p(e1 ) = 0.7 and p(e3 ) = 0.5. Because of the statement ∃playsfor v Player,
using T (instead of an empty TBox) increases the probability of messi to be an
answer to the query Player(x) from 0.7 to 0.85.
6                                    Jean Christoph Jung and Carsten Lutz

(5) The resulting pABox is

          playsfor(villa, FCbarca)      e5 playsfor(villa, realmadrid)       ¬e5

with p(e5 ) = 0.5. Although Player(villa) does not occur in the data, the proba-
bility of villa to be an answer to the query Player(x) is 1 (again by the TBox-
statement ∃playsfor v Player).
(3)+(4) This results in the pABox

                playsfor(messi, FCbarca)       e3 Striker(messi)      e4
                  playsfor(villa, FCbarca)     e3  Striker(villa)     e04
                playsfor(pedro, FCbarca)       e3 Striker(pedro)      e004

with p(e3 ) = 0.5 and p(e4 ) = p(e04 ) = p(e004 ) = 0.8. Due to the last three CIs in T ,
each of messi, villa, pedro is an answer to the CQ ∃y.playsfor(x, y)∧SoccerClub(y)
with probability 0.9.

Related Models in Probabilistic Databases. Nowadays, there is an abundance of
probabilistic data models that provide compact representation of distributions
over potentially large sets of possible worlds, see [15,30,2] and the references
therein. Our pABoxes can be viewed as an open world version of the probabilistic
data model studied by Dalvi and Suciu in [9]. It is as a less succinct version of c-
tables, a traditional data model for probabilistic databases due to Imielinski and
Lipski [18]. Since we are working with an open world semantics, pABoxes instead
represent a distribution over possible world descriptions. Each such description
may have any number of models. Note that our semantics is similar to the
semantics of (“type 2”) probabilistic first-order and description logics [17,26].
Dealing with Inconsistencies. Of course, some of the ABoxes AE might be incon-
sistent w.r.t. the TBox T used. In this case, it may be undesirable to let them
contribute to the probabilities of answers. For example, if we use the pABox

                      Striker(messi)      e1 Goalie(messi)      e2

with p(e1 ) = 0.8 and p(e2 ) = 0.3 and the TBox Goalie u Striker v ⊥, then
messi is an answer to the query SoccerClub(x) with probability 0.24 while one
would probably expect it to be zero (which is the result when the empty TBox is
used). We follow Antova, Koch, and Olteanu and advocate a pragmatic solution
based on rescaling [2]. More specifically, we remove those ABoxes AE that are
inconsistent w.r.t. T and rescale the remaining set of ABoxes so that they sum
up to probability one. In other words, we set
                                       pA,T (a ∈ q) − p(A, T |= ⊥)
                    pbA,T (a ∈ q) =
                                             1 − p(A, T |= ⊥)
where ⊥ is a Boolean query that is entailed exactly by those ABoxes A that are
inconsistent w.r.t. T . The rescaled probability pbA,T (a ∈ q) can be computed in
PTime when this is the case both for pA,T (a ∈ q) and p(A, T |= ⊥). Note that
                              Ontology-Based Access to Probabilistic Data      7

rescaling results in some effects that might be unexpected such as reducing the
probability of messi to be an answer to Striker(x) from 0.8 to ≈0.74 when the
above TBox is added.
    In the remainder of the paper, for simplicity we will only admit TBoxes T
such that all ABoxes A are consistent w.r.t. T .


4     Query Rewriting
The aim of this section is to show that FO-rewriting, a prominent approach
to traditional OBDA, are fruitful also in the case of computing probabilities of
certain answers in probabilistic OBDA. In particular, we use it to lift the PTime
vs. #P dichotomy result on probabilistic databases recently obtained by Dalvi,
Suciu, and Schnaitter [8] to probabilistic OBDA in DL-Lite.

4.1   Lifting FO-Rewritings to probabilistic OBDA
A first-order query qT is an FO-rewriting of a CQ q and an FO-TBox T (i.e., a
first-order theory) if certT (q, A) = ans(qT , IA ) for every ABox A, where IA to
denotes the ABox A viewed as an interpretation. The query rewriting approach
to traditional OBDA consists of computing, given a CQ q and a TBox T , an
FO-rewriting qT and then handing it over for execution to a relational database
system that stores the ABox A.
    The following observation states that FO-rewritings from traditional OBDA
are also useful in probabilistic OBDA. We use pdA (a ∈ q) to denote the prob-
ability that a is an answer to the query q given the pABox A viewed as a
probabilistic database in the sense of Dalvi and Suciu [8]. More specifically,
                                             X
                        pdA (a ∈ q) =                     p(E)
                                   E⊆EA | a∈ans(q,IAE )

The following is immediate from the definitions.
Theorem 1 (Lifting). Let T be an FO-TBox, A a pABox, q an n-ary CQ,
a ∈ Ind(A)n a candidate answer for q, and qT an FO-rewriting of q relative
to T . Then pA,T (a ∈ q) = pdA (a ∈ qT ).
From an application perspective, Theorem 1 enables the use of probabilistic
database systems such as MayBMS, Trio, and MystiQ for implementing proba-
bilistic OBDA [1,35,5]. Note that it might be necessary to adapt pABoxes in an
appropriate way in order to match the data models of these systems. However,
such modifications do not impair applicability of Theorem 1.
    From a theoretical viewpoint, Theorem 1 establishes query rewriting as a
useful tool for analyzing data complexity in probabilistic OBDA. We say that
a CQ q is in PTime relative to a TBox T if there is a polytime algorithm
that, given an ABox A and a candidate answer a ∈ Ind(A)n to q, computes
pA,T (a ∈ q). We say that q is #P-hard relative to T if the afore mentioned
8                                Jean Christoph Jung and Carsten Lutz

problem is hard for the counting complexity class #P [33]. We pursue a non-
uniform approach to the complexity of query answering in probabilistic OBDA,
as recently initiated in [27]: ideally, we would like to understand the precise
complexity of every CQ q relative to every TBox T , against the background of
some preferably expressive ‘master logic’ used for T .
   Unsurprisingly, pABoxes are too strong a formalism to admit any tractable
queries worth mentioning. An n-ary CQ q is trivial for a TBox T iff for every
ABox A, we have certT (A, q) = Ind(A)n .
Theorem 2. Over pABoxes, every CQ q is #P -hard relative to every first-order
TBox T for which it is nontrivial.
Theorem 2 motivates the study of more lightweight probabilistic ABox for-
malisms. While pABoxes (roughly) correspond to c-tables, which are among
the most expressive probabilistic data models, we now move to the other end
of the spectrum and introduce ipABoxes as a counterpart of tuple independent
databases [9,12]. Argueably, the latter are the most inexpressive probabilistic
data model that is still useful.
Definition 3 (ipABox). An assertion-independent pABox (or ipABox) is a
probabilistic ABox in which all event expressions are atomic and where each
atomic event expression is associated with at most one ABox assertion.
To save notation, we write ipABoxes in the form (A, p) where A is an ABox and
p is a map A → [0, 1] that assigns a probability to each ABox assertion. In this
representation, the events are implicit (one atomic event per ABox assertion) and
we write p(α) to denote the probability of the event associated with the assertion
α. Analogously to pABoxes, all events (and thus all assertions) are independent.
Note that the answer probability of a ∈ Ind(A)n to an n-ary conjunctive query
q over an ipABox (A, p) relative to a TBox T simplifies to
                                             X
                     pA,T (a ∈ q) =                     p(A0 )
                                    A0 ⊆A : a∈certT (q,A0 )
          0
               Q             Q
where p(A ) = α∈A0 p(α) · α∈A\A0 (1 − p(α)).
    Reconsidering our web data extraction example, it turns out that cases (1)
and (4) yield ipABoxes, whereas cases (2), (3), and (5) do not. We refer to [32]
for a discussion of the usefulness of ipABoxes/tuple independent databases. For
the remainder of the paper, we assume that only ipABoxes are admitted unless
explicitly noted otherwise.

4.2   Lifting the PTime vs. #P Dichotomy
We now use Theorem 1 to lift a PTime vs. #P dichotomy recently obtained in
the area of probabilistic databases to probabilistic OBDA in DL-Lite. Note that,
for any CQ and DL-Lite TBox, an FO-rewriting is guaranteed to exist [6]. The
central observation is that, by Theorem 1, computing the probability of answers
to a CQ q relative to a TBox T over ipABoxes is exactly the same problem as
computing the probability of answers to qT over (ipABoxes viewed as) tuple
                                    Ontology-Based Access to Probabilistic Data     9


                        r                       s                s        s
                    A                      t    A
                   s          t0
                                                r                     r
                          t
                     q1                        q2                    q3

                                   Fig. 1. Example queries

independent databases. We can thus analyze the complexity of CQs/TBoxes
over ipABoxes by analyzing the complexity of their rewritings. In particular,
standard rewriting techniques produce for each CQ and DL-Lite TBox an FO-
rewriting that is a union of conjunctive queries (a UCQ) and thus, together with
Theorem 1, Dalvi, Suciu and Schnaitter’s PTime vs. #P dichotomy for UCQs
over tuple independent databases [8] immediately yields the following.
Theorem 3 (Abstract Dichotomy). Let q be a CQ and T a DL-Lite TBox.
Then q is in PTime relative to T or q is #P-hard relative to T .
Note that Theorem 3 actually holds for every DL that enjoys FO-rewritability.
Although interesting from a theoretical perspective, Theorem 3 is not fully satis-
factory as it does not tell us which CQs are in PTime relative to which TBoxes.
In the remainder of this section, we carry out a careful inspection of the FO-
rewritings obtained in our framework and of the dichotomy result obtained by
Dalvi, Suciu and Schnaitter, which results in a more concrete formulation of the
dichotomy stated in Theorem 3 and provides a transparent characterization of
the PTime cases. For simplicity, we concentrate on CQs that are connected,
Boolean, and do not contain individual names.
    For two CQs q, q 0 and a TBox T , we say that q T -implies q 0 and write q vT q 0
when certT (q, A) ⊆ certT (q 0 , A) for all ABoxes A; we say that q and q 0 are T -
equivalent and write q ≡T q 0 if q vT q 0 and q 0 vT q; we say that q is T -minimal
if there is no q 0 ( q such that q ≡T q 0 . When T is empty, we simply drop it from
the introduced notation, writing for example q v q 0 and speaking of minimality.
To have more control over the effect of the TBox, we will generally work with
CQs q and TBoxes T such that q is T -minimal. This is without loss of generality
because for every CQ q and TBox T , we can find a CQ q 0 that is T -minimal
and such that q ≡T q 0 [4]; note that the answer probabilities relative to T are
identical for q and q 0 .
    We now introduce a class of queries that will play a crucial role in our analysis.
Definition 4 (Simple Tree Queries). A CQ q is a simple tree if there is a
variable xr ∈ var(q) that occurs in every atom in q, i.e., all atoms in q are of
the form A(xr ), r(xr , y), or r(y, xr ) (y = xr is possible). Such a variable xr is
called a root variable.
As examples, consider the CQs in Figure 1, which are all simple tree queries.
The following result shows why simple tree queries are important. A UCQ qb is
reduced if for all disjuncts q, q 0 of qb, q v q 0 implies q = q 0 .
10                               Jean Christoph Jung and Carsten Lutz

Theorem 4. Let q be a CQ and T a DL-Lite TBox such that q is T -minimal
and not a simple tree query. Then q is #P-hard relative to T .
Proof. (sketch) Let qT be a UCQ that is an FO-rewriting of q relative to T . By
definition of FO-rewritings, we can w.l.o.g. assume that q occurs as a disjunct
of qT . The following is shown in [8]:
 1. if a minimal CQ does not contain a variable that occurs in all atoms, then
    it is #P-hard over tuple independent databases;
 2. if a reduced UCQ qb contains a CQ that is #P-hard over tuple independent
    databases, then qb is also hard over tuple independent databases.
Note that since q is T -minimal, it is also minimal. By Points 1 and 2 above,
it thus suffices to show that qT can be converted into an equivalent reduced
UCQ such that q is still a disjunct, which amounts to proving that there is no
disjunct q 0 in qT such that q v q 0 and q 0 6v q. The details of the proof, which is
surprisingly subtle, are given in the appendix.                                    o
   To finish anlyzing the dichotomy, it thus remains to analyze simple tree
queries. We say that a role R is T -generated in a CQ q if one of the following
holds: (i) there is an atom R(xr , y) ∈ q and y 6= xr ; (ii) there is an atom
A(xr ) ∈ q and T |= ∃R v A; (iii) there is an atom S(x, y) ∈ q with x a root
variable and such that y 6= x occurs only in this atom, and T |= ∃R v ∃S. The
concrete version of our dichotomy result is as follows. Its proof is based on a
careful analysis of FO-rewritings and the results in [10].
Theorem 5 (Concrete Dichotomy). Let T be a DL-Lite TBox. A T -minimal
CQ q is in PTime relative to T iff
1. q is a simple tree query, and
2. if r and r− are T -generated in q, then {r(x, y)} vT q or q is of the form
   {S1 (x, y), . . . , Sk (x, y)} for roles S1 , . . . , Sk .
Otherwise, q is #P-hard relative to T .
As examples, consider again the queries q1 , q2 , and q3 in Figure 1 and let T∅ be
the empty TBox. All CQs are T∅ -minimal, q1 and q2 are in PTime, and q3 is
#P-hard (all relative to T∅ ). Now consider the TBox T = {∃s v ∃r}. Then q1
is T -minimal and still in PTime; q2 is T -minimal, and is now #P-hard because
both s and s− is T -generated. The CQ q3 can be made T -minimal by dropping
the r-atom, and is in PTime relative to T .
Approximations. Theorems 4 and 5 show that only very simple pairs (q, T ) can
be answered in PTime. Hence, it is reasonable to consider the approximation
of answer probabilities. Of particular relevance are fully-polynomial randomized
approximation schemes (FPRASes), see [22,34] for a definition and more details.
In [9] it is observed that there is an FPRAS for every UCQ over a probabilistic
database. As a consequence of Theorem 1, there is thus also an FPRAS for every
pair (q, T ) such that q is FO-rewritable relative to T ; in particular, there are
FPRASes for every CQ and DL-Lite ontology even if additional features such
                                Ontology-Based Access to Probabilistic Data        11

as role inclusions are admitted. This observation clearly gives hope for practical
feasibility of probabilistic OBDA. In the full paper, we also consider the existence
of FPRASes for more expressive ontology languages [19].


5    Beyond Query Rewriting
A CQ q is FO-rewritable relative to a TBox T if there is an FO-rewriting of q
relative to T . The aim of this section is to establish that, in a sense, FO-rewriting
is a complete tool for proving PTime results for CQ answering in probabilistic
OBDA: we show that whenever a CQ q is not FO-rewritable relative to a TBox
T , then q is #P-hard relative to T ; thus, when a query is in PTime relative to a
TBox T , then this can always be shown via FO-rewritability. To achieve this goal,
we select ELI as the TBox language because it properly generalizes DL-Lite (as
in the previous sections we disregard ⊥) and, unlike DL-Lite, also embraces non
FO-rewritable CQs/TBoxes. Note that, in traditional OBDA, there is a drastic
difference in data complexity of CQ-answering between DL-Lite and ELI: the
former is in AC0 while the latter is PTime-complete.
    We focus on Boolean CQs q that are rooted, i.e., q involves at least one
individual name and is connected. This is a natural case since, for any non-
Boolean connected CQ q(x) and potential answer a, the probability pA,T (a ∈
q(x)) that a is a certain answer to q w.r.t. A and T is identical to the probability
p(A, T |= q[a]) that A and T entail the rooted Boolean CQ q[a]. The following
theorem says that there is no hope for PTime algorithms in the case a query q
is not FO-rewritable relative to T .
Theorem 6. If a Boolean rooted CQ q is not FO-rewritable relative to an ELI-
TBox T , then q is #P-hard relative to T .
As a by-product of Theorem 6 we obtain the following dichotomy.
Theorem 7 (ELI dichotomy). Let q be a rooted Boolean CQ and T an ELI-
TBox. Then q is in PTime relative to T or #P-hard relative to T .


6    Conclusion
We have introduced a framework for ontology-based access to probabilistic data
that can be implemented using existing probabilistic database system, and we
have analyzed the data complexity of computing answer probabilities in this
framework. There are various opportunities for future work. For example, it
would be interesting to extend the concrete dichotomy: on the one hand, it can
be extended to CQs that involve constants and are not necessarily connected;
on the other hand, one could study more expressive versions of DL-Lite that, for
example, allow role hierarchy statements in the TBox. It would also be worth-
while to add means to express uncertainty to the TBox formalism instead of
admitting it only in the ABox; this is done for example in [28,12], but it remains
to be seen whether the semantics used there is appropriate for our purposes.
12                                  Jean Christoph Jung and Carsten Lutz

References

 1. Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing
    of uncertain data. In: Proc. of ICDE. 983–992 (2008)
                                             6
 2. Antova, L., Koch, C., Olteanu, D.: 1010 worlds and beyond: efficient representation
    and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)
 3. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.
    (eds.): The Description Logic Handbook. Cambridge University Press (2003)
 4. Bienvenu, M., Lutz, C., Wolter, F.: Query containment in description logics recon-
    sidered. In: Proc. of KR (2012)
 5. Boulos, J., Dalvi, N.N., Mandhani, B., Mathur, S., Ré, C., Suciu, D.: MYSTIQ:
    a system for finding more answers by using probabilities. In: Proc. of SIGMOD.
    891–893 (2005)
 6. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable
    reasoning and efficient query answering in description logics: The DL-Lite family.
    J. Autom. Reasoning 39(3), 385–429 (2007)
 7. Dalvi, N.N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Com-
    mun. ACM 52(7), 86–94 (2009)
 8. Dalvi, N.N., Schnaitter, K., Suciu, D.: Computing query probability with incidence
    algebras. In: Proc. of PODS. 203–214. ACM (2010)
 9. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB
    J. 16(4), 523–544 (2007)
10. Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of con-
    junctive queries. J. ACM 59(6), 30 (2012)
11. Finger, M., Wassermann, R., Cozman, F.G.: Satisfiability in EL with sets of prob-
    abilistic ABoxes. Proc. of DL. CEUR-WS, Vol. 745 (2011)
12. Fuhr, N., Rölleke, T.: A probabilistic relational algebra for the integration of infor-
    mation retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)
13. Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi,
    G., Schallhart, C., Sellers, A.J., Wang, C.: Diadem: domain-centric, intelligent,
    automated data extraction methodology. In Proc. of WWW. 267–270. ACM (2012)
14. Gottlob, G., Lukasiewicz, T., Simari, G.I.: CQ answering in probabilistic
    datalog+/− ontologies. In Proc. of RR. Vol. 6902 of LNCS, 77–92. Springer (2011)
15. Green, T.J., Tannen, V.: Models for incomplete and probabilistic information.
    IEEE Data Engineering Bulletin 29(1), 17–24 (2006)
16. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extrac-
    tion models. In Proc. of VLDB. 965–976. ACM (2006)
17. Halpern, J.Y.: An analysis of first-order logics of probability. Artif. Intell. 46(3),
    311–350 (1990)
18. Imielinski, T., Jr., W.L.: Incomplete information in relational databases. J. of the
    ACM 31(4), 761–791 (1984)
19. Jung, J.C., Lutz, C.: Ontology-based access to probabilistic data with OWL QL.
    In: Proc. of ISWC. pp. 182–197. Springer (2012)
20. Jerrum, M., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial
    structures from a uniform distribution. Theor. Comput. Sci. 43, 169–188 (1986)
21. Karger, D.R.: A randomized fully polynomial time approximation scheme for the
    all-terminal network reliability problem. SIAM J. Comput. 29(2), 492–514 (1999)
22. Karp, R.M., Luby, M.: Monte-carlo algorithms for enumeration and reliability prob-
    lems. In Proc. of FoCS. 56–64. IEEE Computer Society (1983)
                                 Ontology-Based Access to Probabilistic Data          13

23. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com-
    bined approach to query answering in DL-Lite. In Proc. of KR. AAAI Press (2010)
24. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey
    of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
25. Lukasiewicz, T., Straccia, U.: Managing uncertainty and vagueness in description
    logics for the semantic web. J. Web Sem. 6(4), 291–308 (2008)
26. Lutz, C., Schröder, L.: Probabilistic description logics for subjective uncertainty.
    In Proc. of KR. AAAI Press (2010)
27. Lutz, C., Wolter, F.: Non-uniform data complexity of query answering in descrip-
    tion logics. In: Proc. of KR. AAAI Press (2012)
28. Raedt, L.D., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its
    application in link discovery. In Proc. of IJCAI. 2468–2473. AAAI Press (2007)
29. Rossman, B.: Homomorphism preservation theorems. J. ACM. 55(3). 1–54 (2008).
30. Sarma, A.D., Benjelloun, O., Halevy, A.Y., Widom, J.: Working models for uncer-
    tain data. In: Proc. of ICDE. IEEE Computer Society (2006)
31. Straccia, U.: Top-k retrieval for ontology mediated access to relational databases.
    In: Information Sciences. 108, 1–23 (2012).
32. Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synthesis Lec-
    tures on Data Management, Morgan & Claypool Publishers (2011)
33. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J.
    Comput. 8(3), 410–421 (1979)
34. Vazirani, V.V.: Approximation algorithms. Springer (2001)
35. Widom, J.: Trio: A system for integrated management of data, accuracy, and lin-
    eage. In Proc. of CIDR. 262–276 (2005)
36. Zenklusen, R., Laumanns, M.: High-confidence estimation of small s-t reliabilities
    in directed acyclic networks. Networks 57(4), 376–388 (2011)