Introduction

Probabilistic Ontological Data Exchange with Bayesian Networks

0 Department of Computer Science, Otto-von-Guericke Universita ̈t Magdeburg , Germany 1 Department of Computer Science, University of Oxford , UK 2 Dept. of Comp. Sci. and Eng. , Univ. Nacional del Sur and CONICET , Argentina

We study the problem of exchanging probabilistic data between ontology-based probabilistic databases. The probabilities of the probabilistic source databases are compactly encoded via Boolean formulas with the variables adhering to the dependencies imposed by a Bayesian network, which are closely related to the management of provenance. For the ontologies and the ontology mappings, we consider different kinds of existential rules from the Datalog+/family. We provide a complete picture of the computational complexity of the problem of deciding whether there exists a probabilistic (universal) solution for a given probabilistic source database relative to a (probabilistic) ontological data exchange problem. We also analyze the complexity of answering UCQs (unions of conjunctive queries) in this framework.

Introduction

Large volumes of uncertain data are best modeled, stored, and processed in probabilistic databases [ 22 ]. Enriching databases with terminological knowledge encoded in ontologies has recently gained increasing importance in the form of ontology-based data access (OBDA) [ 21 ]. A crucial problem in OBDA is to integrate and exchange knowledge. Not only in the context of OBDA, but also in the area of the Semantic Web, there are distributed ontologies that we may have to map and integrate to enable query answering over them. Here, apart from the uncertainty attached to source databases, there may also be uncertainty regarding the ontology mappings establishing the proper correspondence between items in the source ontology and items in the target ontology. This especially happens when the mappings are created automatically.

Data exchange [ 11 ] is an important theoretical framework used for studying datainteroperability tasks that require data to be transferred from existing databases to a target database that comes with its own (independently created) schema and schema constraints. The expressivity of the data exchange framework goes beyond the classical data integration framework [ 17 ]. For the translation, schema mappings are used, which are declarative specifications that describe the relationship between two database schemas. In classical data exchange, we have a source database, a target database, a deterministic mapping, and deterministic target dependencies. Recently, a framework for probabilistic data exchange [ 10 ] has been proposed where the classical data exchange framework based on weakly acyclic existential rules has been extended to consider a probabilistic source database and a probabilistic source-to-target mapping.

In this paper, we study an expressive extension of the probabilistic data exchange framework in [ 10 ], where the source and the target are ontological knowledge bases, each consisting of a probabilistic database and a deterministic ontology describing terminological knowledge about the data stored in the database. The two ontologies and the mapping between them are expressed via existential rules. Our extension of the data exchange framework is strongly related to exchanging data between incomplete databases, as proposed in [ 3 ], which considers an incomplete deterministic source database in the data exchange problem. However, in that work, the databases are deterministic, and the mappings and the target database constraints are full existential rules only. In our complexity analysis in this paper, we consider a host of different classes of existential rules, including some subclasses of full existential rules. In addition, our source is a probabilistic database relative to an underlying ontology.

Our work in this paper is also related to the recently proposed knowledge base exchange framework [ 2, 1 ], which allows knowledge to be exchanged between deterministic DL-LiteRDF S and DL-LiteR ontologies. In this paper, besides considering probabilistic source databases, we are also using more expressive ontology languages, since already linear existential rules from the Datalog+/– family are strictly more expressive than the description logics (DLs) DL-LiteX of the DL-Lite family [ 9 ] as well as their extensions with n-ary relations DLR-LiteX . Guarded existential rules are sufficiently expressive to model the tractable DL E L [ 4, 5 ] (and E LIf [ 16 ]). Note that existential rules are also known as tuple-generating dependencies (TGDs) and Datalog+/– rules [ 7 ]. The main contributions of this paper are summarized as follows.

We introduce deterministic and probabilistic ontological data exchange problems, where probabilistic knowledge is exchanged between two Bayesian network-based probabilistic databases relative to their underlying deterministic ontologies, and the deterministic and probabilistic mapping between the two ontologies is defined via deterministic and probabilistic existential mapping rules, respectively.

We provide an in-depth analysis of the data and combined complexity of deciding the existence of probabilistic (universal) solutions and obtain a (fairly) complete picture of the data complexity, general combined complexity, bounded-arity (ba) combined, and fixed-program combined (fp) complexity for the main sublanguages of the Datalog+/– family. We also delineate some tractable special cases, and provide complexity results for exact UCQ (union of conjunctive queries) answering.

For the complexity analysis, we consider a compact encoding of probabilistic source databases and mappings, which is used in the area of both incomplete and probabilistic databases, and also known as data provenance or data lineage [ 14, 12, 13, 22 ]. Here, we consider data provenance for probabilistic data that is structured according to an underlying Bayesian network. 2

Preliminaries

We assume infinite sets of constants C, (labeled) nulls N, and regular variables V. A term t is a constant, null, or variable. An atom has the form p(t1; : : : ; tn), where p is an n-ary predicate, and t1; : : : ; tn are terms. Conjunctions of atoms are often identified with the sets of their atoms. An instance I is a (possibly infinite) set of atoms p(t), where t is a tuple of constants and nulls. A database D is a finite instance that contains only constants. A homomorphism is a substitution h : C [ N [ V ! C [ N [ V that is the identity on C. We assume familiarity with conjunctive queries (CQs). The answer to a CQ q over an instance I is denoted q(I). A Boolean CQ (BCQ) q evaluates to true over I, denoted I j= q, if q(I) 6= ?.

A tuple-generating dependency (TGD) is a first-order formula 8X '(X) ! 9Y p(X; Y), where X [ Y V, '(X) is a conjunction of atoms, and p(X; Y) is an atom. We call '(X) the body of , denoted body ( ), and p(X; Y) the head of , denoted head ( ). We consider only TGDs with a single atom in the head, but our results can be extended to TGDs with a conjunction of atoms in the head. An instance I satisfies , written I j= , if the following holds: whenever there exists a homomorphism h such that h('(X)) I, then there exists h0 hjX, where hjX is the restriction of h to X, such that h0(p(X; Y)) 2 I. A negative constraint (NC) is a first-order formula 8X '(X) ! ?, where X V, '(X) is a conjunction of atoms, called the body of , denoted body ( ), and ? denotes the truth constant false. An instance I satisfies , denoted I j= , if there is no homomorphism h such that h('(X)) I. Given a set of TGDs and NCs, I satisfies , denoted I j= , if I satisfies each TGD and NC of . For brevity, we omit the universal quantifiers in front of TGDs and NCs.

Given a database D and a set of TGDs and NCs, the answers that we consider are those that are true in all models of D and . Formally, the models of D and , denoted mods(D; ), is the set of instances fI j I D and I j= g. The answer to a CQ q relative to D and is defined as the set of tuples ans(q; D; ) = TI2mods(D; )ft j t 2 q(I)g. The answer to a BCQ q is true, denoted D [ j= q, if ans(q; D; ) 6= ?. The problem of CQ answering is defined as follows: given a database D, a set of TGDs and NCs, a CQ q, and a tuple of constants t, decide whether t 2 ans(q; D; ). Following Vardi’s taxonomy [ 23 ], the combined complexity of BCQ answering is calculated by considering all the components, i.e., the database, the set of dependencies, and the query, as part of the input. The bounded-arity combined complexity (or simply ba-combined complexity) is calculated by assuming that the arity of the underlying schema is bounded by an integer constant. Notice that in the context of description logics (DLs), whenever we refer to the combined complexity in fact we refer to the ba-combined complexity since, by definition, the arity of the underlying schema is at most two. The fixed-program combined complexity (or simply fp-combined complexity) is calculated by considering the set of TGDs and NCs as fixed. 3

Ontological Data Exchange

In this section, we define the notions of deterministic and probabilistic ontological data exchange. The source (resp., target) of the deterministic/probabilistic ontological data exchange problems that we consider in this paper is a probabilistic database (resp., probabilistic instance), each relative to a deterministic ontology. Here, a probabilistic database (resp., probabilistic instance) over a schema S is a probability space P r = (I; ) such that I is the set of all (possibly infinitely many) databases (resp., instances) over S, and : I ! [0; 1] is a function that satisfies PI2I (I) = 1. 3.1 Ontological data exchange formalizes data exchange from a probabilistic database relative to a source ontology s (consisting of TGDs and NCs) over a schema S to a probabilistic target instance Prt relative to a target ontology t (consisting of a set of TGDs and NCs) over a schema T via a (source-to-target) mapping (also consisting of a set of TGDs and NCs). More specifically, an ontological data exchange (ODE) problem M= (S, T, s, t, st) consists of (i) a source schema S, (ii) a target schema T disjoint from S, (iii) a finite set s of TGDs and NCs over S (called source ontology), (iv) a finite set t of TGDs and NCs over T (called target ontology), and (v) a finite set st of TGDs and NCs over S [ T (called (source-to-target) mapping) such that body( ) and head( ) are defined over S [ T and T, respectively.

Ontological data exchange with deterministic databases is based on defining a target instance J over T as being a solution for a deterministic source database I over S relative to an ODE problem M = (S; T; s; t; st), if (I [ J ) j= s [ t [ st. We denote by SolM the set of all such pairs (I; J ). Among the possible deterministic solutions J to a deterministic source database I relative to M in SolM, we prefer universal solutions, which are the most general ones carrying only the necessary information for data exchange, i.e., those that transfer only the source database along with the relevant implicit derivations via s to the target ontology. A universal solution can be homomorphically mapped to all other solutions leaving the constants unchanged. Hence, a deterministic target instance J over S is a universal solution for a deterministic source database I over T relative to a schema mapping M, if (i) J is a solution, and (ii) for each solution J 0 for I relative to M, there is a homomorphism h : J ! J 0. We denote by USol M ( SolM) the set of all pairs (I; J ) of deterministic source databases I and target instances J such that J is a universal solution for I relative to M.

When considering probabilistic databases and instances, a joint probability space Pr over the solution relation SolM and the universal solution relation USolM must exist. More specifically, a probabilistic target instance Prt = (J ; t) is a probabilistic solution (resp., probabilistic universal solution) for a probabilistic source database Prs = (I; s) relative to an ODE problem M = (S; T; s; t; st), if there exists a probability space Pr = (I J ; ) such that (i) the left and right marginals of Pr are Prs and Prt, respectively, i.e., (i.a) s(I) = PJ2J (I; J ) for all I 2 I, (i.b) t(J ) = PI2I (I; J ) for all J 2 J ; and (ii) (I; J ) = 0 for all (I; J ) 62 SolM (resp., (I; J ) 62 USolM). Note that this intuitively says that all non-solutions (I; J ) have probability zero and the existence of a solution does not exclude that some source databases with probability zero have no corresponding target instance.

Example 1. An ontological data exchange (ODE) problem M = (S; T; s; t; st) is given by the source schema S = fResearcher=2; ResearchArea=2; Publication=3g (the number after each predicate denotes its arity), the target schema T = fUResearchArea=3; Lecturer=2g, the source ontology s = f s; sg, the target ontology t = f t, tg, and the mapping st = f st; mg, where: s : Publication(X; Y; Z) ! ResearchArea(X; Y); s : Researcher(X; Y) ^ ResearchArea(X; Y) ! ?; t : UResearchArea(U; D; T) ! 9Z Lecturer(T; Z); t : Lecturer(X; Y) ^ Lecturer(Y; X) ! ?;

Possible source database facts ra Researcher(Alice, UnivOx) rp Researcher(Paul, UnivOx) paml Publication(Alice, ML, JMLR) padb Publication(Alice, DB, TODS) ppdb Publication(Paul, DB, TODS) ppai Publication(Paul, AI, AIJ) Probabilistic source database Prs = (I; s) I1 = fra,rp,paml,ppdb,aaml,apdbg 0.5 I2 = fra,rp,paml,ppai,aaml,apaig 0.2 I3 = fra,rp,padb,ppai,aadb,apaig 0.15 I4 = fra,rp,padb,ppdb,aadb,apdbg 0.075 I5 = fra,padb,aadbg 0.075 Probabilistic target instance Prt1 = (J1; t1 ) J1 = fuml,udb,lml,ldbg 0.5 J2 = fuml,uai,lml,laig 0.2 J3 = fuai,udb,lai,ldbg 0.15 J4 = fudb,ldbg 0.15

Derived source database facts aaml ResearchArea(Alice, ML) aadb ResearchArea(Alice, DB) apdb ResearchArea(Paul, DB) apai ResearchArea(Paul, AI)

Possible target instance facts uml UResearchArea(UnivOx, N1, ML) uai UResearchArea(UnivOx, N2, AI) udb UResearchArea(UnivOx, N3, DB) lml Lecturer(ML, N4) lai Lecturer(AI, N5) ldb Lecturer(DB, N6) J5 = fuml,udb,lml,ldbg 0.55 J6 = fuml,uai,lml,laig 0.1 J7 = fuml,uai,udb,lml,lai,ldbg 0.35 Probabilistic target instance Prt2 = (J2; t2 ) (N1; : : : ; N6 are nulls); both are probabilistic solutions, but only Prt1 is universal. Given the probabilistic source database in Table 1, two probabilistic instances Prt1 = (J1; t1 ) and Prt2 = (J2; t2 ) that are probabilistic solutions are shown in Table 1. Note that only Prt1 is also a probabilistic universal solution. Note also that Figures 1 and 2 show the probability spaces over Prt1 and Prt2 , respectively.

Query answering in ontological data exchange is performed over the target ontology and is generalized from deterministic data exchange. A union of conjunctive queries (or UCQ) has the form q(X) = Wk i=1 9Yi

i(X; Yi; Ci), where each 9Yi i(X; Yi; Ci) with i 2 f1; : : : ; kg is a CQ with exactly the variables X and Yi, and the constants Ci. Given an ODE problem M = (S, T, s; t; st), probabilistic source database Prs = (I; s), UCQ q(X) = Wik=1 9Yi i(X; Yi; Ci), and tuple t (a ground instance of X in q) over C, the confidence of t relative to q, denoted conf q(t), in Prs relative to M is the infimum of Prt(q(t)) subject to all probabilistic solutions Prt for Prs relative to M. Here, Prt(q(t)) for Prt = (J ; t) is the sum of all t(J ) such that q(t) evaluates to true in the instance J 2 J (i.e., some BCQ 9Yi i(t; Yi; Ci) with i 2 f1; : : : ; kg evaluates to true in J ).

Example 2. Consider again the setting of Example 1, and let q be a UCQ of a student who wants to know whether she can study either machine learning or artificial intelligence at the University of Oxford: q() = 9X; Z(Lecturer(AI; X) ^ UResearchArea(UnivOx; Z; AI)) _ 9X; Z(Lecturer(ML; X) ^ UResearchArea(UnivOx; Z; ML)). Then, q yields the probabilities 0:85 and 1 on Prt1 and Prt2 , respectively. 3.2

Probabilistic Ontological Data Exchange

Probabilistic ontological data exchange extends deterministic ontological data exchange by turning the deterministic source-to-target mapping into a probabilistic source-totarget mapping, i.e., we have a probability distribution over the set of all subsets of st. More specifically, a probabilistic ontological data exchange (PODE) problem M = (S; T; s; t; st; st) consists of (i) a source schema S, (ii) a target schema T disjoint from S, (iii) a finite set s of TGDs and NCs over S (called source ontology), (iv) a finite set t of TGDs and NCs over T (called target ontology), (v) a finite set st of TGDs and NCs over S [ T, and (vi) a function st : 2 st ! [0; 1] such that P 0 st st( 0) = 1 (called probabilistic (source-to-target) mapping).

A probabilistic target instance Prt = (J ; t) is a probabilistic solution (resp., probabilistic universal solution) for a probabilistic source database Prs = (I; s) relative to a PODE problem M = (S; T; s; t; st; st), if there exists a probability space Pr = = (I J 2 st ; ) such that: (i) the three marginals of are s, t, and st, such that: (i.a) s(I) = PJ2J ; 0 st (I; J; 0) for all I 2 I, (i.b) t(J ) = PI2I; 0 st (I; J; 0) for all J 2 J , and (i.c) st( 0) = PI2I; J2J (I; J; 0) for all 0 st; and (ii) (I; J; 0) = 0 for all (I; J ) 62 Sol (S;T; 0) (resp., (I; J ) 62 USol (S;T; 0)).

Using probabilistic (universal) solutions for probabilistic source databases relative to PODE problems, the semantics of UCQs is lifted to PODE problems as follows. Given a PODE problem M = (S, T, s; t; st; st), a probabilistic source database Prs = (I; s), a UCQ q(X) = Wik=1 9Yi i(X; Yi; Ci), and a tuple t (a ground instance of X in q) over C, the confidence of t relative to q, denoted conf q(t), in Prs relative to M is the infimum of Prt(q(t)) subject to all probabilistic solutions Prt for Prs relative to M. Here, Prt(q(t)) for Prt = (J ; t) is the sum of all t(J ) such that q(t) evaluates to true in the instance J 2 J . 3.3

Compact Encoding

We use a compact encoding of both probabilistic databases and probabilistic mappings, which is based on annotating facts, TGDs, and NCs by probabilistic events in a Bayesian network, rather than explicitly specifying the whole probability space.

We first define annotations and annotated atoms. Let e1; : : : ; en be n 1 elementary events. A world w is a conjunction `1 ^ ^`n, where each `i, i 2 f1; : : : ; ng, is either the elementary event ei or its negation :ei. An annotation is any Boolean combination of elementary events (i.e., all elementary events are annotations, and if 1 and 2

Possible source database facts Annotation ra Researcher(Alice, UnivOx) true rp Researcher(Paul, UnivOx) e1_ e2_ e3_ e4 paml Publication(Alice, ML, JMLR) e1_ e2 padb Publication(Alice, DB, TODS) : e1 ^ : e2 ppdb Publication(Paul, DB, TODS) e1_ (: e2 ^ : e3^ e4) ppai Publication(Paul, AI, AIJ) (: e1^ e2) _ (: e1^ e3) are annotations, then also : 1 and 1 ^ 2). An annotated atom has the form a : , where a is an atom, and is an annotation.

The compact encoding of probabilistic databases can then be defined as follows. Note that this encoding is also underlying our complexity analysis in Section 4. A set A of annotated atoms along with a probability (w) 2 [0; 1] for every world w compactly encodes a probabilistic database P r = (I; ) whenever: (i) the probability of every annotation is the sum of the probabilities of all worlds in which is true, and (ii) the probability of every subset-maximal database fa1; : : : ; amg 2 I 4 such that fa1 : 1; : : : ; am : mg A for some annotations 1; : : : ; m is the probability of 1 ^ ^ m (and the probability of every other database in I is 0).

We assume that the probability distributions for the underlying events are given by a Bayesian network, which is usually used for compactly specifying a joint probability space, encoding also a certain causal structure between the variables. The following example in Tables 2 and 3 illustrates the compact encoding of probabilistic source databases via Boolean annotations relative to an underlying Bayesian network.

If the mapping is probabilistic as well, then we use two disjoint sets of elementary events, one for encoding the probabilistic source database and the other one for the mapping. In this way, the probabilistic source database is independent from the probabilistic mapping. We now define the compact encoding of probabilistic mappings. An annotated TGD (resp., NC) has the form : , where is a TGD (resp., NC), and is an annotation. A set of annotated TGDs and NCs : with 2 st along with a probability (w) 2 [0; 1] for every world w compactly encodes a probabilistic mappings st : 2 st ! [0; 1] whenever (i) the probability of every annotation is the sum of the probabilities of all worlds in which is true, and (ii) the probability st of every

4 That is, we do not consider subsets of the databases here.

subset-maximal f 1; : : : ; kg st such that f 1 : 1; : : : ; k : kg for some annotations 1; : : : ; k is the probability of 1 ^ ^ k (and the probability st of every other subset of st is 0). 3.4

Computational Problems We consider the following computational problems:

Existence of a solution (resp., universal solution): Given an ODE or a PODE problem M and a probabilistic source database Prs, decide whether there exists a probabilistic (resp., probabilistic universal) solution for Prs relative to M. Answering UCQs: Given an ODE or a PODE problem M, a probabilistic source database Prs, a UCQ q(X), and a tuple t over C, compute conf Q(t) in Prs w.r.t. M. 4

Computational Complexity

We now analyze the computational complexity of deciding the existence of a (universal) probabilistic solution for deterministic and probabilistic ontological data exchange problems. We also delineate some tractable special cases, and we provide some complexity results for exact UCQ answering for ODE and PODE problems.

We assume some elementary background in complexity theory [ 15, 20 ]. We now briefly recall the complexity classes that we encounter in our complexity results. The complexity classes PSPACE (resp., P, EXP, 2EXP) contain all decision problems that can be solved in polynomial space (resp., polynomial, exponential, double exponential time) on a deterministic Turing machine, while the complexity classes NP and NEXP contain all decision problems that can be solved in polynomial and exponential time on a nondeterministic Turing machine, respectively; coNP and coNEXP are their complementary classes, where “Yes” and “No” instances are interchanged. The complexity class AC0 is the class of all languages that are decidable by uniform families of Boolean circuits of polynomial size and constant depth. The inclusion relationships among the above (decision) complexity classes (all currently believed to be strict) are as follows: AC0

NP; coNP

PSPACE EXP

NEXP; coNEXP 2EXP

The (function) complexity class #P is the set of all functions that are computable by a polynomial-time nondeterministic Turing machine whose output for a given input string I is the number of accepting computations for I. 4.1

Decidability Paradigms

The main (syntactic) conditions on TGDs that guarantee the decidability of CQ answering are guardedness [ 6 ], stickiness [ 8 ], and acyclicity. Each one of these conditions has its “weak” counterpart: weak guardedness [ 6 ], weak stickiness [ 8 ], and weak acyclicity [ 11 ], respectively.

A TGD is guarded if there exists an atom in its body that contains (or “guards”) all the body variables of . The class of guarded TGDs, denoted G, is defined as the Data Comb. ba-comb. fp-comb.

Data Comb. ba-comb. fp-comb.

L, LF, AF in AC0 PSPACE

G P 2EXP WG EXP 2EXP S, SF in AC0 EXP F, GF P EXP

A in AC0 NEXP WS, WA P 2EXP

NP EXP EXP NP NP NEXP 2EXP

NP NP EXP NP NP NP NP

L, LF, AF coNP PSPACE coNP

G coNP 2EXP EXP WG EXP 2EXP EXP S, SF coNP EXP coNP F, GF coNP EXP coNP

A coNP coNEXP coNEXP WS, WA coNP 2EXP 2EXP coNP coNP EXP coNP coNP coNP coNP family of all possible sets of guarded TGDs. A key subclass of guarded TGDs are the so-called linear TGDs with just one body atom (which is automatically a guard), and the corresponding class is denoted L. Weakly guarded TGDs extend guarded TGDs by requiring only “harmful” body variables to appear in the guard, and the associated class is denoted WG. It is easy to verify that L G WG.

Stickiness is inherently different from guardedness, and its central property can be described as follows: variables that appear more than once in a body (i.e., join variables) are always propagated (or “stick”) to the inferred atoms. A set of TGDs that enjoys the above property is called sticky, and the corresponding class is denoted S. Weak stickiness is a relaxation of stickiness where only “harmful” variables are taken into account. A set of TGDs which enjoys weak stickiness is weakly sticky, and the associated class is denoted WS. Observe that S WS.

A set of TGDs is acyclic if its predicate graph is acyclic, and the underlying class is denoted A. In fact, an acyclic set of TGDs can be seen as a nonrecursive set of TGDs. We say is weakly acyclic if its dependency graph enjoys a certain acyclicity condition, which actually guarantees the existence of a finite canonical model; the associated class is denoted WA. Clearly, A WA.

Another key fragment of TGDs, which deserves our attention, are the so-called full TGDs, i.e., TGDs without existentially quantified variables, and the corresponding class is denoted F. If we further assume that full TGDs enjoy linearity, guardedness, stickiness, or acyclicity, then we obtain the classes LF, GF, SF, and AF, respectively. 4.2

Overview of Complexity Results

Our complexity results for deciding the existence of a probabilistic (universal) solution for both ODE and PODE problems with annotations over events relative to an underlying Bayesian network are summarized in Fig. 4 for all classes of existential rules discussed above in the data, combined, ba -combined, and fp-combined complexity (all entries are completeness results). For L, LF, AF, S, SF, and A in the data complexity, we obtain tractability when the underlying Bayesian network is a polytree. For all other cases, hardness holds even when the underlying Bayesian network is a polytree. Finally, for all classes of existential rules discussed above except for WG, answering UCQs for both ODE and PODE problems is in #P in the data complexity. The first result shows that deciding whether there exists a probabilistic (or probabilistic universal) solution for a probabilistic source database relative to an ODE problem is complete for C (resp., coC), if BCQ answering for the involved sets of TGDs and NCs is complete for a deterministic (resp., nondeterministic) complexity class C PSPACE (resp., C NP), and hardness holds even for ground atomic BCQs. As a corollary, by the complexity of BCQ answering with TGDs and NCs in Figure 3 [ 18 ], we immediately obtain the complexity results shown in Figure 4 for deciding the existence of a probabilistic (universal) solution (in deterministic ontological data exchange) in the combined, ba-combined, and fp-combined complexity, and for the class WG of TGDs and NCs in the data complexity. The hardness results hold even when the underlying Bayesian network is a polytree.

Theorem 1. Given a probabilistic source database P rs relative to a source ontology s and an ODE problem M = (S; T; s; t; st) such that s [ t [ st belongs to a class of TGDs and NCs for which BCQ answering is complete for a deterministic (resp., nondeterministic) complexity class C PSPACE (resp., C NP), and hardness holds even for ground atomic BCQs, deciding the existence of a probabilistic (universal) solution for P rs relative to s and M is complete for C (resp., coC). Hardness holds even when the underlying Bayesian network is a polytree.

The following result shows that deciding whether there exists a probabilistic (universal) solution for a probabilistic source database relative to an ODE problem is complete for coNP in the data complexity, for all classes of sets of TGDs and NCs considered in this paper, except for WG. Hardness for coNP for the classes G, F, GF, WS, and WA holds even when the underlying Bayesian network is a polytree.

Theorem 2. Given a probabilistic source database P rs relative to a source ontology s and an ODE problem M = (S; T; s; t; st) such that s [ t [ st belongs to a class among L, LF, AF, G, S, SF, F, GF, A, WS, and WA, deciding whether there exists a probabilistic (or probabilistic universal) solution for P rs relative to s and M is coNP-complete in the data complexity. Hardness for coNP for the classes G, F, GF, WS, and WA holds even when the underlying Bayesian network is a polytree.

The following result shows that deciding whether there exists a probabilistic (or probabilistic universal) solution for a probabilistic source database relative to an ODE problem is in P in the data complexity, if BCQ answering for the involved sets of TGDs and NCs is first-order rewritable as a Boolean UCQ, and the underlying Bayesian network is a polytree. As a corollary, by the complexity of BCQ answering with TGDs and NCs, deciding the existence of a solution is in P for the classes L, LF, AF, S, SF, and A in the data complexity, if the underlying Bayesian network is a polytree. Theorem 3. Given a probabilistic source database P rs relative to a source ontology s, with a polytree as Bayesian network, and an ODE problem M = (S; T; s; t; st) such that s [ t [ st belongs to a class of TGDs and NCs for which BCQ answering is first-order rewritable as a Boolean UCQ, deciding whether there exists a probabilistic (universal) solution for P rs relative to s and M is in P in the data complexity.

Finally, the following theorem shows that answering UCQs for probabilistic source databases relative to an ODE problem is complete for #P in the data complexity for all above classes of existential rules except for WG.

Theorem 4. Given (i) an ODE problem M = (S; T; t; s; st) such that s [ st [ t belongs to a class among L, LF, AF, G, S, SF, F, GF, A, WS, and WA, and (ii) a probabilistic source database P rs relative to s such that there exists a solution for P rs relative to M, (iii) a UCQ Q = q(X) over T, and (iv) a tuple a, computing confQ(a) is #P-complete in the data complexity. All the results of Section 4.3 in Theorems 1 and 4 carry over to the case of probabilistic ontological data exchange. Clearly, the hardness results carry over immediately, since deterministic ontological data exchange is a special case of probabilistic ontological data exchange. As for the membership results, we additionally consider the worlds for the probabilistic mapping, which are iterated through in the data complexity and guessed in the combined, the ba-combined, and the fp-combined complexity. 5

Summary and Outlook

We have defined deterministic and probabilistic ontological data exchange problems, where probabilistic knowledge is exchanged between two ontologies. The two ontologies and the mapping between them are defined via existential rules, where the rules for the mapping are deterministic and probabilistic, respectively. We have given a precise analysis of the computational complexity of deciding the existence of a probabilistic (universal) solution for different classes of existential rules in both deterministic and probabilistic ontological data exchange. We also have delineated some tractable special cases, and we have provided some complexity results for exact UCQ answering.

An interesting topic for future research is to further explore the tractable cases of probabilistic solution existence and whether they can be extended, e.g., by slightly generalizing the type of the mapping rules. Another issue for future work is to further analyze the complexity of answering UCQs for different classes of existential rules in deterministic and probabilistic ontological data exchange.

Acknowledgments. This work was supported by an EU (FP7/2007-2013) Marie-Curie Intra-European Fellowship (“PRODIMA”), the UK EPSRC grant EP/J008346/1 (“PrOQAW”), the ERC grant 246858 (“DIADEM”), a Yahoo! Research Fellowship, and funds from Universidad Nacional del Sur and CONICET, Argentina. This paper is a short version of a paper that appeared in Proc. RuleML 2015 [ 19 ].

1. Arenas , M. , Botoeva , E. , Calvanese , D. , Ryzhikov , V. : Exchanging OWL2 QL knowledge bases . In: Proc. IJCAI . pp. 703 - 710 ( 2013 )

2. Arenas , M. , Botoeva , E. , Calvanese , D. , Ryzhikov , V. , Sherkhonov , E.: Exchanging description logic knowledge bases . In: Proc. KR . pp. 563 - 567 ( 2012 )

3. Arenas , M. , Pe´rez, J., Reutter , J.L. : Data exchange beyond complete data . J. ACM 60 ( 4 ), 28 : 1 - 28 : 59 ( 2013 )

4. Baader , F. : Least common subsumers and most specific concepts in a description logic with existential restrictions and terminological cycles . In: Proc. IJCAI . pp. 364 - 369 ( 2003 )

5. Baader , F. , Brandt , S. , Lutz , C. : Pushing the E L envelope . In: Proc. IJCAI . pp. 364 - 369 ( 2005 )

6. Cal`ı, A. , Gottlob , G. , Kifer , M. : Taming the infinite chase: Query answering under expressive relational constraints . J. Artif. Intell. Res . 48 , 115 - 174 ( 2013 )

7. Cali , A. , Gottlob , G. , Lukasiewicz , T. , Marnette , B. , Pieris , A. : Datalog+/ -: A family of logical knowledge representation and query languages for new applications . In: Proc. LICS . pp. 228 - 242 ( 2010 )

8. Cal`ı, A. , Gottlob , G. , Pieris , A. : Towards more expressive ontology languages: The query answering problem . Artif. Intell . 193 , 87 - 128 ( 2012 )

9. Calvanese , D. , De Giacomo , G. , Lembo , D. , Lenzerini , M. , Rosati , R. : Tractable reasoning and efficient query answering in description logics: The DL-Lite family . J. Autom. Reasoning 39 ( 3 ), 385 - 429 ( 2007 )

10. Fagin , R. , Kimelfeld , B. , Kolaitis , P.G. : Probabilistic data exchange . J. ACM 58 ( 4 ), 15 : 1 - 15 : 55 ( 2011 )

11. Fagin , R. , Kolaitis , P.G. , Miller , R.J. , Popa , L. : Data exchange: Semantics and query answering . Theor. Comput. Sci . 336 ( 1 ), 89 - 124 ( 2005 )

12. Fuhr , N. , Ro¨ lleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems . ACM Trans. Inf. Sys . 15 ( 1 ), 32 - 66 ( 1997 )

13. Green , T.J. , Karvounarakis , G. , Tannen , V. : Provenance semirings . In: Proc. PODS . pp. 31 - 40 ( 2007 )

14. Imielinski , T. , Witold

Lipski

, J.: Incomplete information in relational databases . J. ACM 31 ( 4 ), 761 - 791 ( 1984 )

15. Johnson , D.S.: A catalog of complexity classes . In: van Leeuwen, J . (ed.) Handbook of Theoretical Computer Science , vol. A, chap. 2 , pp. 67 - 161 . MIT Press ( 1990 )

16. Krisnadhi , A. , Lutz , C. : Data complexity in the E L family of description logics . In: Proc. LPAR , LNCS, vol. 4790 , pp. 333 - 347 . Springer ( 2007 )

17. Lenzerini , M. : Data integration: A theoretical perspective . In: Proc. PODS . pp. 233 - 246 ( 2002 )

18. Lukasiewicz , T. , Martinez , M.V. , Pieris , A. , Simari , G.I. : From classical to consistent query answering under existential rules . In: Proc. AAAI . pp. 1546 - 1552 ( 2015 )

19. Lukasiewicz , T. , Martinez , M.V. , Predoiu , L. , Simari , G.I. : Existential rules and Bayesian networks for probabilistic ontological data exchange . In: Proc. RuleML. LNCS , vol. 9202 , pp. 294 - 310 . Springer ( 2015 )

20. Papadimitriou , C.H.: Computational Complexity. Addison-Wesley ( 1994 )

21. Poggi , A. , Lembo , D. , Calvanese , D. , De Giacomo , G. , Lenzerini , M. , Rosati , R.: Linking data to ontologies . J. Data Sem . 10 , 133 - 173 ( 2008 )

22. Suciu , D. , Olteanu , D. , Re´, C. , Koch ,

C.: Probabilistic

Databases. M &

C (

2011 )

23. Vardi , M.Y.: The complexity of relational query languages (extended abstract) . In: Proc. STOC . pp. 137 - 146 ( 1982 )