Ontology-Based Access to Probabilistic Data Jean Christoph Jung and Carsten Lutz Universität Bremen, Germany {jeanjung,clu}@informatik.uni-bremen.de Abstract. We propose a framework for querying probabilistic data in the presence of an ontology, arguing that the interplay of probabilities and ontologies is fruitful in applications such as managing data that was extracted from the web. The prime inference problem is computing answer probabilities, and we show that it can be implemented using standard probabilistic database systems, similar to traditional ontology- based data access. We demonstrate that query rewriting into first-order logic is an important tool for our framework. First, it is used to establish a PTime vs. #P dichotomy for the data complexity of this problem by lifting a corresponding result from probabilistic databases. Then, we use it to characterize which pairs of query and TBox are in PTime. Finally, it is shown that non-existence of such a rewriting implies #P-hardness. 1 Introduction In recent years, ontology based data access (OBDA) has become an active area of description logic research. In OBDA, an ontology provides a semantics for incom- plete data with the aim of facilitating the computation of more complete answers to queries. There are applications, though, in which it is necessary to query data that is not only incomplete, but also uncertain. For instance, data extracted from web sources [24] such as an estate agents’ web page is typically incomplete. It is also uncertain because web sources tend to be unreliable and extraction tools are based on heuristic decisions and thus significantly error prone. In this paper, we propose an extension of OBDA that captures uncertain data through a probabilistic data model and replaces the computation of certain answers with computing the probabilities of certain answers. In brief, our approach relates to probabilistic database systems (PDBMSs) in the same way that traditional OBDA relates to RDBMSs. Framework. We consider probabilistic ABox formalisms that are inspired by data models from the currently very active area of probabilistic databases [7,32]. Specifically, pABoxes enrich classical ABoxes with probabilities that are attached to probabilistic events, and with event expressions that are attached to ABox assertions. For example, a pABox assertion SoccerPlayer(messi) can be associ- ated with an event expression e1 ∨ e2 , where e1 and e2 represent events such as ‘web extraction tool x correctly analyzed webpage y stating that Messi is a soccer player’. Event e1 can then be associated with probability .7, and e2 with .9. Events are assumed to be probabilistically independent, which results 2 Jean Christoph Jung and Carsten Lutz in a straightforward semantics that is similar to well-known probabilistic versions of datalog [28,12]. Ontologies are represented by description logic TBoxes. In this setting, which we call ontology based access to probabilistic data (pOBDA), we are interested in computing the probabilities of certain answers to conjunctive queries (CQs). Note that uncertainty occurs only in the data, but neither in the ontology nor in the query. We believe that pOBDA is of general interest and potentially useful for a wide range of applications including the management of data extracted from the web, machine translation, and dealing with data that arises from sensor networks. All these applications can potentially benefit from a fruitful interplay between ontologies and probabilities; in particular, the ontology can help to reduce the uncertainty of the data. Contributions. The main aim of this paper is to study the data complexity of pOBDA. More precisely, we pursue a non-uniform approach as recently initiated in [27], which aims at fully classifying the data complexity of every pair (q, T ) that consists of a CQ q and a TBox formulated in a fixed ‘master DL’. As a central tool of our analysis, we use query rewriting into first-order (FO) queries, which is an important technique for traditional OBDA [6]. We start with showing that FO-rewritings from traditional OBDA are useful also in the context of pOBDA: for any pABox A, the probability that a tuple a is a certain answer to q over a pABox A relative to T is identical to the probability that a is an answer to the FO-rewriting qT of q and T over A viewed as a probabilistic database. This lifting of FO-rewritings to the probabilistic case immediately implies that one can implement pOBDA based on existing PDBMSs such as MayBMS, Trio, and MystiQ [1,35,5]. Lifting also allows us to carry over the dichotomy between PTime and #P- hardness for computing the probabilities of answers to unions of conjunctive queries (UCQs) over probabilistic databases recently obtained by Dalvi, Suciu, and Schnaitter [8] to our pOBDA framework, provided that we restrict ourselves to TBoxes formulated in (the core version of) DL-Lite and to ipABoxes, which are a special case of pABoxes in which all ABox assertions are probabilistically independent. Based on a careful syntactic analysis, we provide a concrete char- acterization of those CQs q and DL-Lite TBoxes T for which computing answer probabilities is in PTime. We then proceed to showing that query rewriting is a complete tool for proving PTime data complexity in pOBDA, in the fol- lowing sense: we replace DL-Lite with the strictly more expressive description logic ELI where, in contrast to DL-Lite, rewritings into first-order queries do not exist for every CQ q and TBox T ; we then prove that if some (q, T ) does not have a rewriting, then computing answer probabilities for q relative to T is #P-hard. Thus, if it is possible at all to prove that some (q, T ) has PTime data complexity, then this can always be done using query rewriting. Both in DL-Lite and ELI, the class of queries and TBoxes with PTime data complexity is relatively small. This negative result is relativized by the fact that answer probabilities can often be efficiently approximated. In particular, all pairs (q, T ) admit approximation in terms of a fully polynomial randomized approximation scheme (FPRAS) whenever q is FO-rewritable relative to T . We provide a brief Ontology-Based Access to Probabilistic Data 3 discussion of such approximations and refer to the full version [19] of this paper for more information. Related Work. The probabilistic ABox formalism studied in this paper is inspired by the probabilistic database models in [9], but can also be viewed as a variation of probabilistic versions of datalog and Prolog, see [28,12] and references therein. There have recently been other approaches to combining ontologies and uncer- tainty for data access [11,14], with a different semantics; the setup considered by Gottlob, Lukasiewicz, and Simari in [14] is close in spirit to the framework studied here, but also allows probabilities in the TBox and has a different, rather intricate semantics based on Markov logic. In fact, we deliberately avoid proba- bilities in the ontology because (i) this results in a simple and fundamental, yet useful formalism that still admits a very transparent semantics and (ii) it enables the use of standard PDBMSs for query answering in the presence of ontologies. Note that the analogous property of being implementable using state-of-the-art RDBMSs is a favorable feature of traditional OBDA [6]. There has also been a large number of proposals for enriching description logic TBoxes (instead of ABoxes) with probabilities, see [25,26] and the references therein. Our running application example is web data extraction, in the spirit of [16] to store extracted web data in a probabilistic database. Note that it has also been proposed to inte- grate both probabilities and ontologies directly into the data extraction tool [13]. We believe that both approaches can be useful and could even be orchestrated to play together. This paper is a condensed version of [19]. For the missing proofs and some additional material we refer the reader to the long version of [19]. 2 Preliminaries We use standard notation for the syntax and semantics of description logics (DLs) and refer to [3] for full details. As usual, C, D denote (potentially) com- posite concepts, A, B concept names, r, s role names, R and S role names or their inverse, and a, b individual names. When R = r− , then R− denotes r. We consider the ontology languages DL-Lite and ELI. Regarding the former, we concentrate on the dialect DL-Litecore , where TBoxes are finite sets of concept inclusions (CIs) B v B 0 and B u B 0 v ⊥ with B and B 0 concepts of the form ∃r, ∃r− , > or A. In ELI, a TBox is a finite set of CIs C v D where C and D are (potentially) compound concepts of the form >, A, C 0 u D0 , ∃r.C 0 , and ∃r− .C 0 . As usual, an ABox is a finite set of concept assertions A(a) and role asser- tions r(a, b). We use Ind(A) to denote the set of individual names used in the ABox A and write r− (a, b) ∈ A for r(b, a) ∈ A. The semantics of DLs is based on interpretations I = (∆I , ·I ); we adopt the unique name assumption (UNA), i.e., we require aI 6= bI for all individuals a 6= b. Conjunctive queries (CQs) take the form ∃y.ϕ(x, y), with ϕ a conjunction of atoms of the form A(t) and r(t, t0 ) and where x, y denote (tuples of) variables and t, t0 denote terms, i.e., a variable or an individual name. We call the variables 4 Jean Christoph Jung and Carsten Lutz in x the answer variables and those in y the quantified variables. The set of all variables in a CQ q is denoted by var(q) and the set of all terms in q by term(q). A CQ q is n-ary if it has n answer variables and Boolean if it is 0-ary. Whenever convenient, we treat a CQ as a set of atoms and sometimes write r− (t, t0 ) ∈ q instead of r(t0 , t) ∈ q. Let I be an interpretation and q a CQ with answer variables x1 , . . . , xk . For individual names a = a1 · · · ak , an a-match for q in I is a mapping π : term(q) → ∆I such that π(xi ) = ai for 1 ≤ i ≤ k, π(a) = aI for all individual names a ∈ term(q), π(t) ∈ AI for all A(t) ∈ q, and (π(t1 ), π(t2 )) ∈ rI for all r(t1 , t2 ) ∈ q. We write I |= q[a] if there is an a-match of q in I and let ans(q, I) denote the set of all a with I |= q[a]. For a TBox T and an ABox A, we write T , A |= q[a] if I |= q[a] for all models I of T and A. The set certT (q, A) of all certain answers consists of all tuples a over Ind(A) with T , A |= q[a]. 3 Probabilistic OBDA We introduce our framework for probabilistic OBDA, starting with the definition of a rather general, probabilistic version of ABoxes. Let E be a countably infinite set of atomic (probabilistic) events. An event expression is built up from atomic events using the Boolean operators ¬, ∧, ∨. We use expr(E) to denote the set of all event expressions over E. A probability assignment for E ⊆ E is a map E → [0, 1]. Definition 1 (pABox). A probabilistic ABox (pABox) is of the form (A, e, p) with A an ABox, e a map A → expr(E), and p a probability assignment for EA , the atomic events in A. Example 1. We consider as a running example an information extraction tool that is gathering data from the web, see [16] for a similar setup. Assume we are gathering data about soccer players and the clubs they play for in the current 2012 season, and we want to represent the result as a pABox. (1) The tool processes a newspaper article stating that ‘Messi is the soul of the Argentinian national soccer team’. Because the exact meaning of this phrase is unclear (it could refer to a soccer player, a coach, a mascot), it generates the assertion Player(messi) associated with the atomic event expression e1 with p(e1 ) = 0.7. The event e1 represents that the phrase was interpreted correctly. (2) The tool finds the Wikipedia page on Lionel Messi, which states that he is a soccer player. Since Wikipedia is typically reliable and up to date, but not always correct, it updates the expression associated with Player(messi) to e1 ∨ e2 and associates e2 with p(e2 ) = 0.95. (3) The tool finds an HTML table on the homepage of FC Barcelona saying that the top scorers of the season are Messi, Villa, and Pedro. It is not stated whether the table refers to the 2011 or the 2012 season, and consequently we generate the ABox assertions playsfor(x, FCbarca) for x ∈ {messi, villa, pedro} all associated with the same atomic event expression e3 with p(e3 ) = 0.5. Intuitively, the event e3 is that the table refers to 2012. Ontology-Based Access to Probabilistic Data 5 (4) Still processing the table, the tool applies the background knowledge that top scorers are typically strikers. It generates three assertions Striker(x) with x ∈ {messi, villa, pedro}, associated with atomic events e4 , e04 , and e004 . It sets p(e4 ) = p(e04 ) = p(e004 ) = 0.8. The probability is higher than in (3) since being a striker is a more stable property than playing for a certain club, thus this information does not depend so much on whether the table is from 2011 or 2012. (5) The tool processes the twitter message ‘Villa was the only one to score a goal in the match between Barca and Real’. It infers that Villa plays either for Barcelona or for Madrid, generating the assertions playsfor(villa, FCbarca) and playsfor(villa, realmadrid). The first assertion is associated with the event e5 , the second one with ¬e5 . It sets p(e5 ) = 0.5. We now definte the semantics of OBDA over pABoxes. Each E ⊆ EA can be viewed as a truth assignment that makes all events in E true and all events in EA \ E false. Definition 2. Let (A, e, p) be a pABox. For each E ⊆ EA , define a corre- sponding non-probabilistic ABox AE := {α ∈ A | E |= e(α)}. The function p represents a probability distribution on 2EA , by setting for each E ⊆ EA : Y Y p(E) = p(e) · (1 − p(e)). e∈E e∈EA \E The probability of an answer a ∈ Ind(A)n to an n-ary conjunctive query q over a pABox A and TBox T is X pA,T (a ∈ q) = p(E). E⊆EA : a∈certT (q,AE ) For Boolean CQs q, we write p(A, T |= q) instead of pA,T (() ∈ q), where () denotes the empty tuple. Example 2. Consider again the web data extraction example discussed above. To illustrate how ontologies can help to reduce uncertainty, we use the DL-Lite TBox T = { ∃playsfor v Player Player v ∃playsfor ∃playsfor− v SoccerClub Striker v Player } Consider the following subcases considered above. (1) + (3) The resulting pABox comprises the following assertions with associated event expressions: Player(messi) e1 playsfor(messi, FCbarca) e3 playsfor(villa, FCbarca) e3 playsfor(pedro, FCbarca) e3 with p(e1 ) = 0.7 and p(e3 ) = 0.5. Because of the statement ∃playsfor v Player, using T (instead of an empty TBox) increases the probability of messi to be an answer to the query Player(x) from 0.7 to 0.85. 6 Jean Christoph Jung and Carsten Lutz (5) The resulting pABox is playsfor(villa, FCbarca) e5 playsfor(villa, realmadrid) ¬e5 with p(e5 ) = 0.5. Although Player(villa) does not occur in the data, the proba- bility of villa to be an answer to the query Player(x) is 1 (again by the TBox- statement ∃playsfor v Player). (3)+(4) This results in the pABox playsfor(messi, FCbarca) e3 Striker(messi) e4 playsfor(villa, FCbarca) e3 Striker(villa) e04 playsfor(pedro, FCbarca) e3 Striker(pedro) e004 with p(e3 ) = 0.5 and p(e4 ) = p(e04 ) = p(e004 ) = 0.8. Due to the last three CIs in T , each of messi, villa, pedro is an answer to the CQ ∃y.playsfor(x, y)∧SoccerClub(y) with probability 0.9. Related Models in Probabilistic Databases. Nowadays, there is an abundance of probabilistic data models that provide compact representation of distributions over potentially large sets of possible worlds, see [15,30,2] and the references therein. Our pABoxes can be viewed as an open world version of the probabilistic data model studied by Dalvi and Suciu in [9]. It is as a less succinct version of c- tables, a traditional data model for probabilistic databases due to Imielinski and Lipski [18]. Since we are working with an open world semantics, pABoxes instead represent a distribution over possible world descriptions. Each such description may have any number of models. Note that our semantics is similar to the semantics of (“type 2”) probabilistic first-order and description logics [17,26]. Dealing with Inconsistencies. Of course, some of the ABoxes AE might be incon- sistent w.r.t. the TBox T used. In this case, it may be undesirable to let them contribute to the probabilities of answers. For example, if we use the pABox Striker(messi) e1 Goalie(messi) e2 with p(e1 ) = 0.8 and p(e2 ) = 0.3 and the TBox Goalie u Striker v ⊥, then messi is an answer to the query SoccerClub(x) with probability 0.24 while one would probably expect it to be zero (which is the result when the empty TBox is used). We follow Antova, Koch, and Olteanu and advocate a pragmatic solution based on rescaling [2]. More specifically, we remove those ABoxes AE that are inconsistent w.r.t. T and rescale the remaining set of ABoxes so that they sum up to probability one. In other words, we set pA,T (a ∈ q) − p(A, T |= ⊥) pbA,T (a ∈ q) = 1 − p(A, T |= ⊥) where ⊥ is a Boolean query that is entailed exactly by those ABoxes A that are inconsistent w.r.t. T . The rescaled probability pbA,T (a ∈ q) can be computed in PTime when this is the case both for pA,T (a ∈ q) and p(A, T |= ⊥). Note that Ontology-Based Access to Probabilistic Data 7 rescaling results in some effects that might be unexpected such as reducing the probability of messi to be an answer to Striker(x) from 0.8 to ≈0.74 when the above TBox is added. In the remainder of the paper, for simplicity we will only admit TBoxes T such that all ABoxes A are consistent w.r.t. T . 4 Query Rewriting The aim of this section is to show that FO-rewriting, a prominent approach to traditional OBDA, are fruitful also in the case of computing probabilities of certain answers in probabilistic OBDA. In particular, we use it to lift the PTime vs. #P dichotomy result on probabilistic databases recently obtained by Dalvi, Suciu, and Schnaitter [8] to probabilistic OBDA in DL-Lite. 4.1 Lifting FO-Rewritings to probabilistic OBDA A first-order query qT is an FO-rewriting of a CQ q and an FO-TBox T (i.e., a first-order theory) if certT (q, A) = ans(qT , IA ) for every ABox A, where IA to denotes the ABox A viewed as an interpretation. The query rewriting approach to traditional OBDA consists of computing, given a CQ q and a TBox T , an FO-rewriting qT and then handing it over for execution to a relational database system that stores the ABox A. The following observation states that FO-rewritings from traditional OBDA are also useful in probabilistic OBDA. We use pdA (a ∈ q) to denote the prob- ability that a is an answer to the query q given the pABox A viewed as a probabilistic database in the sense of Dalvi and Suciu [8]. More specifically, X pdA (a ∈ q) = p(E) E⊆EA | a∈ans(q,IAE ) The following is immediate from the definitions. Theorem 1 (Lifting). Let T be an FO-TBox, A a pABox, q an n-ary CQ, a ∈ Ind(A)n a candidate answer for q, and qT an FO-rewriting of q relative to T . Then pA,T (a ∈ q) = pdA (a ∈ qT ). From an application perspective, Theorem 1 enables the use of probabilistic database systems such as MayBMS, Trio, and MystiQ for implementing proba- bilistic OBDA [1,35,5]. Note that it might be necessary to adapt pABoxes in an appropriate way in order to match the data models of these systems. However, such modifications do not impair applicability of Theorem 1. From a theoretical viewpoint, Theorem 1 establishes query rewriting as a useful tool for analyzing data complexity in probabilistic OBDA. We say that a CQ q is in PTime relative to a TBox T if there is a polytime algorithm that, given an ABox A and a candidate answer a ∈ Ind(A)n to q, computes pA,T (a ∈ q). We say that q is #P-hard relative to T if the afore mentioned 8 Jean Christoph Jung and Carsten Lutz problem is hard for the counting complexity class #P [33]. We pursue a non- uniform approach to the complexity of query answering in probabilistic OBDA, as recently initiated in [27]: ideally, we would like to understand the precise complexity of every CQ q relative to every TBox T , against the background of some preferably expressive ‘master logic’ used for T . Unsurprisingly, pABoxes are too strong a formalism to admit any tractable queries worth mentioning. An n-ary CQ q is trivial for a TBox T iff for every ABox A, we have certT (A, q) = Ind(A)n . Theorem 2. Over pABoxes, every CQ q is #P -hard relative to every first-order TBox T for which it is nontrivial. Theorem 2 motivates the study of more lightweight probabilistic ABox for- malisms. While pABoxes (roughly) correspond to c-tables, which are among the most expressive probabilistic data models, we now move to the other end of the spectrum and introduce ipABoxes as a counterpart of tuple independent databases [9,12]. Argueably, the latter are the most inexpressive probabilistic data model that is still useful. Definition 3 (ipABox). An assertion-independent pABox (or ipABox) is a probabilistic ABox in which all event expressions are atomic and where each atomic event expression is associated with at most one ABox assertion. To save notation, we write ipABoxes in the form (A, p) where A is an ABox and p is a map A → [0, 1] that assigns a probability to each ABox assertion. In this representation, the events are implicit (one atomic event per ABox assertion) and we write p(α) to denote the probability of the event associated with the assertion α. Analogously to pABoxes, all events (and thus all assertions) are independent. Note that the answer probability of a ∈ Ind(A)n to an n-ary conjunctive query q over an ipABox (A, p) relative to a TBox T simplifies to X pA,T (a ∈ q) = p(A0 ) A0 ⊆A : a∈certT (q,A0 ) 0 Q Q where p(A ) = α∈A0 p(α) · α∈A\A0 (1 − p(α)). Reconsidering our web data extraction example, it turns out that cases (1) and (4) yield ipABoxes, whereas cases (2), (3), and (5) do not. We refer to [32] for a discussion of the usefulness of ipABoxes/tuple independent databases. For the remainder of the paper, we assume that only ipABoxes are admitted unless explicitly noted otherwise. 4.2 Lifting the PTime vs. #P Dichotomy We now use Theorem 1 to lift a PTime vs. #P dichotomy recently obtained in the area of probabilistic databases to probabilistic OBDA in DL-Lite. Note that, for any CQ and DL-Lite TBox, an FO-rewriting is guaranteed to exist [6]. The central observation is that, by Theorem 1, computing the probability of answers to a CQ q relative to a TBox T over ipABoxes is exactly the same problem as computing the probability of answers to qT over (ipABoxes viewed as) tuple Ontology-Based Access to Probabilistic Data 9 r s s s A t A s t0 r r t q1 q2 q3 Fig. 1. Example queries independent databases. We can thus analyze the complexity of CQs/TBoxes over ipABoxes by analyzing the complexity of their rewritings. In particular, standard rewriting techniques produce for each CQ and DL-Lite TBox an FO- rewriting that is a union of conjunctive queries (a UCQ) and thus, together with Theorem 1, Dalvi, Suciu and Schnaitter’s PTime vs. #P dichotomy for UCQs over tuple independent databases [8] immediately yields the following. Theorem 3 (Abstract Dichotomy). Let q be a CQ and T a DL-Lite TBox. Then q is in PTime relative to T or q is #P-hard relative to T . Note that Theorem 3 actually holds for every DL that enjoys FO-rewritability. Although interesting from a theoretical perspective, Theorem 3 is not fully satis- factory as it does not tell us which CQs are in PTime relative to which TBoxes. In the remainder of this section, we carry out a careful inspection of the FO- rewritings obtained in our framework and of the dichotomy result obtained by Dalvi, Suciu and Schnaitter, which results in a more concrete formulation of the dichotomy stated in Theorem 3 and provides a transparent characterization of the PTime cases. For simplicity, we concentrate on CQs that are connected, Boolean, and do not contain individual names. For two CQs q, q 0 and a TBox T , we say that q T -implies q 0 and write q vT q 0 when certT (q, A) ⊆ certT (q 0 , A) for all ABoxes A; we say that q and q 0 are T - equivalent and write q ≡T q 0 if q vT q 0 and q 0 vT q; we say that q is T -minimal if there is no q 0 ( q such that q ≡T q 0 . When T is empty, we simply drop it from the introduced notation, writing for example q v q 0 and speaking of minimality. To have more control over the effect of the TBox, we will generally work with CQs q and TBoxes T such that q is T -minimal. This is without loss of generality because for every CQ q and TBox T , we can find a CQ q 0 that is T -minimal and such that q ≡T q 0 [4]; note that the answer probabilities relative to T are identical for q and q 0 . We now introduce a class of queries that will play a crucial role in our analysis. Definition 4 (Simple Tree Queries). A CQ q is a simple tree if there is a variable xr ∈ var(q) that occurs in every atom in q, i.e., all atoms in q are of the form A(xr ), r(xr , y), or r(y, xr ) (y = xr is possible). Such a variable xr is called a root variable. As examples, consider the CQs in Figure 1, which are all simple tree queries. The following result shows why simple tree queries are important. A UCQ qb is reduced if for all disjuncts q, q 0 of qb, q v q 0 implies q = q 0 . 10 Jean Christoph Jung and Carsten Lutz Theorem 4. Let q be a CQ and T a DL-Lite TBox such that q is T -minimal and not a simple tree query. Then q is #P-hard relative to T . Proof. (sketch) Let qT be a UCQ that is an FO-rewriting of q relative to T . By definition of FO-rewritings, we can w.l.o.g. assume that q occurs as a disjunct of qT . The following is shown in [8]: 1. if a minimal CQ does not contain a variable that occurs in all atoms, then it is #P-hard over tuple independent databases; 2. if a reduced UCQ qb contains a CQ that is #P-hard over tuple independent databases, then qb is also hard over tuple independent databases. Note that since q is T -minimal, it is also minimal. By Points 1 and 2 above, it thus suffices to show that qT can be converted into an equivalent reduced UCQ such that q is still a disjunct, which amounts to proving that there is no disjunct q 0 in qT such that q v q 0 and q 0 6v q. The details of the proof, which is surprisingly subtle, are given in the appendix. o To finish anlyzing the dichotomy, it thus remains to analyze simple tree queries. We say that a role R is T -generated in a CQ q if one of the following holds: (i) there is an atom R(xr , y) ∈ q and y 6= xr ; (ii) there is an atom A(xr ) ∈ q and T |= ∃R v A; (iii) there is an atom S(x, y) ∈ q with x a root variable and such that y 6= x occurs only in this atom, and T |= ∃R v ∃S. The concrete version of our dichotomy result is as follows. Its proof is based on a careful analysis of FO-rewritings and the results in [10]. Theorem 5 (Concrete Dichotomy). Let T be a DL-Lite TBox. A T -minimal CQ q is in PTime relative to T iff 1. q is a simple tree query, and 2. if r and r− are T -generated in q, then {r(x, y)} vT q or q is of the form {S1 (x, y), . . . , Sk (x, y)} for roles S1 , . . . , Sk . Otherwise, q is #P-hard relative to T . As examples, consider again the queries q1 , q2 , and q3 in Figure 1 and let T∅ be the empty TBox. All CQs are T∅ -minimal, q1 and q2 are in PTime, and q3 is #P-hard (all relative to T∅ ). Now consider the TBox T = {∃s v ∃r}. Then q1 is T -minimal and still in PTime; q2 is T -minimal, and is now #P-hard because both s and s− is T -generated. The CQ q3 can be made T -minimal by dropping the r-atom, and is in PTime relative to T . Approximations. Theorems 4 and 5 show that only very simple pairs (q, T ) can be answered in PTime. Hence, it is reasonable to consider the approximation of answer probabilities. Of particular relevance are fully-polynomial randomized approximation schemes (FPRASes), see [22,34] for a definition and more details. In [9] it is observed that there is an FPRAS for every UCQ over a probabilistic database. As a consequence of Theorem 1, there is thus also an FPRAS for every pair (q, T ) such that q is FO-rewritable relative to T ; in particular, there are FPRASes for every CQ and DL-Lite ontology even if additional features such Ontology-Based Access to Probabilistic Data 11 as role inclusions are admitted. This observation clearly gives hope for practical feasibility of probabilistic OBDA. In the full paper, we also consider the existence of FPRASes for more expressive ontology languages [19]. 5 Beyond Query Rewriting A CQ q is FO-rewritable relative to a TBox T if there is an FO-rewriting of q relative to T . The aim of this section is to establish that, in a sense, FO-rewriting is a complete tool for proving PTime results for CQ answering in probabilistic OBDA: we show that whenever a CQ q is not FO-rewritable relative to a TBox T , then q is #P-hard relative to T ; thus, when a query is in PTime relative to a TBox T , then this can always be shown via FO-rewritability. To achieve this goal, we select ELI as the TBox language because it properly generalizes DL-Lite (as in the previous sections we disregard ⊥) and, unlike DL-Lite, also embraces non FO-rewritable CQs/TBoxes. Note that, in traditional OBDA, there is a drastic difference in data complexity of CQ-answering between DL-Lite and ELI: the former is in AC0 while the latter is PTime-complete. We focus on Boolean CQs q that are rooted, i.e., q involves at least one individual name and is connected. This is a natural case since, for any non- Boolean connected CQ q(x) and potential answer a, the probability pA,T (a ∈ q(x)) that a is a certain answer to q w.r.t. A and T is identical to the probability p(A, T |= q[a]) that A and T entail the rooted Boolean CQ q[a]. The following theorem says that there is no hope for PTime algorithms in the case a query q is not FO-rewritable relative to T . Theorem 6. If a Boolean rooted CQ q is not FO-rewritable relative to an ELI- TBox T , then q is #P-hard relative to T . As a by-product of Theorem 6 we obtain the following dichotomy. Theorem 7 (ELI dichotomy). Let q be a rooted Boolean CQ and T an ELI- TBox. Then q is in PTime relative to T or #P-hard relative to T . 6 Conclusion We have introduced a framework for ontology-based access to probabilistic data that can be implemented using existing probabilistic database system, and we have analyzed the data complexity of computing answer probabilities in this framework. There are various opportunities for future work. For example, it would be interesting to extend the concrete dichotomy: on the one hand, it can be extended to CQs that involve constants and are not necessarily connected; on the other hand, one could study more expressive versions of DL-Lite that, for example, allow role hierarchy statements in the TBox. It would also be worth- while to add means to express uncertainty to the TBox formalism instead of admitting it only in the ABox; this is done for example in [28,12], but it remains to be seen whether the semantics used there is appropriate for our purposes. 12 Jean Christoph Jung and Carsten Lutz References 1. Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: Proc. of ICDE. 983–992 (2008) 6 2. Antova, L., Koch, C., Olteanu, D.: 1010 worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009) 3. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook. Cambridge University Press (2003) 4. Bienvenu, M., Lutz, C., Wolter, F.: Query containment in description logics recon- sidered. In: Proc. of KR (2012) 5. Boulos, J., Dalvi, N.N., Mandhani, B., Mathur, S., Ré, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proc. of SIGMOD. 891–893 (2005) 6. Calvanese, D., Giacomo, G.D., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-Lite family. J. Autom. Reasoning 39(3), 385–429 (2007) 7. Dalvi, N.N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Com- mun. ACM 52(7), 86–94 (2009) 8. Dalvi, N.N., Schnaitter, K., Suciu, D.: Computing query probability with incidence algebras. In: Proc. of PODS. 203–214. ACM (2010) 9. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007) 10. Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of con- junctive queries. J. ACM 59(6), 30 (2012) 11. Finger, M., Wassermann, R., Cozman, F.G.: Satisfiability in EL with sets of prob- abilistic ABoxes. Proc. of DL. CEUR-WS, Vol. 745 (2011) 12. Fuhr, N., Rölleke, T.: A probabilistic relational algebra for the integration of infor- mation retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997) 13. Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A.J., Wang, C.: Diadem: domain-centric, intelligent, automated data extraction methodology. In Proc. of WWW. 267–270. ACM (2012) 14. Gottlob, G., Lukasiewicz, T., Simari, G.I.: CQ answering in probabilistic datalog+/− ontologies. In Proc. of RR. Vol. 6902 of LNCS, 77–92. Springer (2011) 15. Green, T.J., Tannen, V.: Models for incomplete and probabilistic information. IEEE Data Engineering Bulletin 29(1), 17–24 (2006) 16. Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extrac- tion models. In Proc. of VLDB. 965–976. ACM (2006) 17. Halpern, J.Y.: An analysis of first-order logics of probability. Artif. Intell. 46(3), 311–350 (1990) 18. Imielinski, T., Jr., W.L.: Incomplete information in relational databases. J. of the ACM 31(4), 761–791 (1984) 19. Jung, J.C., Lutz, C.: Ontology-based access to probabilistic data with OWL QL. In: Proc. of ISWC. pp. 182–197. Springer (2012) 20. Jerrum, M., Valiant, L.G., Vazirani, V.V.: Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci. 43, 169–188 (1986) 21. Karger, D.R.: A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem. SIAM J. Comput. 29(2), 492–514 (1999) 22. Karp, R.M., Luby, M.: Monte-carlo algorithms for enumeration and reliability prob- lems. In Proc. of FoCS. 56–64. IEEE Computer Society (1983) Ontology-Based Access to Probabilistic Data 13 23. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com- bined approach to query answering in DL-Lite. In Proc. of KR. AAAI Press (2010) 24. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002) 25. Lukasiewicz, T., Straccia, U.: Managing uncertainty and vagueness in description logics for the semantic web. J. Web Sem. 6(4), 291–308 (2008) 26. Lutz, C., Schröder, L.: Probabilistic description logics for subjective uncertainty. In Proc. of KR. AAAI Press (2010) 27. Lutz, C., Wolter, F.: Non-uniform data complexity of query answering in descrip- tion logics. In: Proc. of KR. AAAI Press (2012) 28. Raedt, L.D., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In Proc. of IJCAI. 2468–2473. AAAI Press (2007) 29. Rossman, B.: Homomorphism preservation theorems. J. ACM. 55(3). 1–54 (2008). 30. Sarma, A.D., Benjelloun, O., Halevy, A.Y., Widom, J.: Working models for uncer- tain data. In: Proc. of ICDE. IEEE Computer Society (2006) 31. Straccia, U.: Top-k retrieval for ontology mediated access to relational databases. In: Information Sciences. 108, 1–23 (2012). 32. Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synthesis Lec- tures on Data Management, Morgan & Claypool Publishers (2011) 33. Valiant, L.G.: The complexity of enumeration and reliability problems. SIAM J. Comput. 8(3), 410–421 (1979) 34. Vazirani, V.V.: Approximation algorithms. Springer (2001) 35. Widom, J.: Trio: A system for integrated management of data, accuracy, and lin- eage. In Proc. of CIDR. 262–276 (2005) 36. Zenklusen, R., Laumanns, M.: High-confidence estimation of small s-t reliabilities in directed acyclic networks. Networks 57(4), 376–388 (2011)