Towards General Representability in Knowledge Exchange Marcelo Arenas1 , Jorge Pérez2 , and Emanuel Sallinger3 1 Pontificia Universidad Católica de Chile & University of Oxford 2 Department of Computer Science, Universidad de Chile 3 Vienna University of Technology 1 Introduction In data exchange, one is typically given a source database instance and a mapping between a source schema and a target schema. The goal then is to materialize a target database instance that corresponds to the source instance and the mapping. In this setting source data is explicitly given, that is, every fact that is true in it is explicitly mentioned in the source database instance. Knowledge bases typically consist of some explicit knowledge – like in database instances – and some implicit knowledge, usually given in the form of rules that specify how to derive knowledge not explicitly stored. In [5], Arenas et al. introduced the setting of knowledge exchange, where one is given a source knowledge base and a mapping between a source schema and a target schema, and the goal is to materialize a target knowledge base, that is, both explicit and implicit knowledge. While the question of what constitutes a “good” solution to a data exchange problem has been the topic of research in recent years [11], the question of what constitutes a “good” solution to a knowledge exchange problem is rather new. In particular, there are now two components, explicit and implicit knowledge, which play rather different roles in what we expect from them. In [5], two especially desirable properties of the target knowledge base were iden- tified. First, one generally wants to minimize explicit knowledge, thus generating as much knowledge by rules as possible. Second, in typical knowledge-based systems (such as those based on RDFS, OWL, or general description logics), the explicit knowledge changes frequently, but the implicit knowledge remains constant over a longer time. Hence, it is desirable to let the target implicit knowledge only depend on the implicit knowledge at the source and the mapping between source and target. Thus, knowledge exchange effectively becomes a two-stage process: 1. materialize target implicit knowledge, given only the source implicit knowledge and the source-to-target mapping; and 2. materialize target explicit knowledge w.r.t. the materialized implicit knowledge, which should be as small as possible. M. Arenas was supported by Fondecyt grant #1110287, J. Pérez was supported by Fondecyt grant #11110404, the research of E. Sallinger is supported by the Austrian Science Fund (FWF):P25207-N23. Here, we focus on the former, and thus say that such target implicit knowledge “rep- resents” the source implicit knowledge under a given mapping. Two such notions of “representation” were introduced recently: safety [5] and Q-representability [3, 4], where Q is a query language. In this work, we introduce the very broad notion of general representability that captures safety as a special case and lays the foundation to the study of representability in a broad setting, in particular with the goal of extending our work to Q-representability. To achieve this, general representability is not limited to knowledge exchange, but based on arbitrary schema mappings. As an essential tool for this study, we develop conditions on schema mapping languages under which notions of equivalence important to our investigation coincide. Organization. In Section 2, we define the necessary preliminaries. We start introducing general representability in Section 3. Then, in Section 4, we show how to capture safety. In Section 5, we introduce notions of equivalence towards capturing Q-representability. We conclude in Section 6. 2 Preliminaries We assume some familiarity with database theory and data exchange (cf., e.g., [1, 6]). A schema S is a set of relation symbols, each having a fixed arity. An instance I ∈ Inst(S) assigns to each relation symbol R a relation RI of the corresponding arity. We denote by dom(I) the set of all constants that occur in a relation of I. We assume instances to be finite and, thus, dom(I) is finite. A schema mapping M (or just mapping) from a source schema S1 to a target schema S2 is a subset of Inst(S1 ) × Inst(S2 ).4 The composition of two mappings, symbolized by ◦, is defined as the composition of binary relations [9]. M and M0 are equivalent, written M ≡ M0 , if they are equal as binary relations (M = M0 ). Besides, we write M  M0 for M ⊆ M0 . We later use the notion of a mapping M “based on” some logical formalism. By that we mean that there exists a set Σ of dependencies from that formalism s.t. M = {(I, J) | (I, J)  Σ}. As these specific formalisms are not the focus of this paper, we refer to e.g. [5] for formal details and definitions. We say that Σ is over a schema S if all relation symbols occurring in Σ are from S. For a set of instances I ⊆ Inst(S), the set of solutions of I under M is SolM (I) = {J | (I, J) ∈ M and I ∈ I}. A single instance I ∈ Inst(S1 ) is treated as a singleton set {I} wherever applicable, e.g. SolM (I) = SolM ({I}). In this paper we deal with knowledge bases in a very general way treating them as mappings over a single schema. A general knowledge base over a schema S is represented simply as a mapping M from S to S such that, intuitively, if I represents the explicit knowledge, then SolM (I) represents all possible completions of I where the implicit knowledge has been made explicit. For keeping schema specifications brief, a subscripted mapping Mi is always from schema Si to Si , and a mapping Mij is always from schema Si to Sj . 4 To adapt to the knowledge exchange setting [5], we do not explicitly consider labeled nulls in target instances, however the concept is preserved as constants only occurring in a target instance are essentially treated as labeled nulls in data exchange [6]. 2 3 General Representability Intuitively, given mappings M1 , M2 and M12 , we say that M2 “represents” M1 under M12 if the following equivalence is fulfilled: M1 ◦ M12 ≡ M12 ◦ M2 . (†) In the context of knowledge exchange, M1 and M2 can be seen as the mappings responsible for generating the completions discussed in the preliminaries, and M12 as the mapping used for transferring data. The condition (†) is very strong. In particular, M12 must already transfer enough of the source data for such an M2 to exist. A very critical point is that we require “logical” equivalence, i.e, for every source instance all the solutions must coincide. It can be demonstrated in simple examples that this requirement is too strong to be widely applicable. Less strict notions of representability have been studied in recent research, namely safety [5] in the context of full tuple-generating dependencies [8], and Q-representability – where Q stands for a class of queries – in the context of description logics [3, 4]. We therefore introduce general representability, a very weak notion of representability which allows us to study relaxed notions of representability in a very general setting. Definition 1. Let ≡χ be an equivalence relation on schema mappings. If there exist mappings M?1 and M?12 such that M1 ◦ M12 ≡χ M?1 ◦ M?12 ◦ M2 , (‡) then M2 is called a general χ-representation of M1 under M12 witnessed by the pair of mappings (M?1 , M?12 ). The intuition behind this definition is the following: M2 may not be the “ideal” repre- sentation as defined by equation (†). Though, if through keeping small parts of M1 and by slightly extending M12 we can show equivalence, this may be good enough for a given application. In particular, it might not be possible to completely remove M1 , but we may have to keep a “part” of M1 , namely M?1 . Second, it might not be possible to use M12 , but we may need a slightly “extended” mapping M?12 . Of course, to study representability in a general setting, we do not formally constrain in this definition how these “parts” and “extensions” must look like. The third way we may adapt our equation to some application is by using an appropriate equivalence relation ≡χ , e.g., by using the so-called data exchange equivalence [7] if our application is related to data exchange. Given this general definition, one has a lot of freedom in choosing M?1 and M?12 . In particular, by choosing M?1 = M1 and M?12 = M12 , even the identity mapping M2 = Id2 = {(I, I) | I ∈ Inst(S2 )} is a general representation of M1 under M12 . On the other end of the spectrum, there is a clear “best” kind of general representation M2 , namely if one can choose M?1 = Id1 = {(I, I) | I ∈ Inst(S1 )} and M?12 = M12 . If these two equalities hold, we speak of a strong χ-representation. We thus described what one could consider “trivial” and “best” representations, however there is much room between those two extremes. For example, it is reasonable to require that while M?1 ◦ M?12 should be able to be more restrictive than M12 , it should not be more restrictive than M1 ◦ M12 together. Altogether, one can thus require that: 3 M1 ◦ M12  M?1 ◦ M?12 and M?1 ◦ M?12  M12 . When these conditions hold, we talk about a weak χ-representation. We intend to use weak χ-representations for characterizing Q-representability in future work. The intuition behind the notion of weak representability is to offer an intermediate amount of freedom between strong representations and unconstrained general representa- tions. In particular, the witness (M?1 , M?12 ) has to be at least as restrictive as the witness of a strong representation (M?1 ◦ M?12  M12 ) and at most as restrictive as the witness of a “trivial” representation (M1 ◦ M12  M?1 ◦ M?12 ). These conditions allow to rule out problematic behaviour that may occur in a general setting, but not in the context of e.g. Q-representability. An orthogonal restriction is to allow either only a source-to-target witness M?12 or only a source witness M?1 in Definition 1, giving rise to the notions of source-to-target χ-representation (M?1 = Id1 ) and source χ-representation (M?12 = M12 ). Formally, M2 is a source-to-target χ-representation of M1 under M12 if M2 is a general χ- representation witnessed by the pair of mappings (Id1 , M?12 ). Source χ-representation can be defined similarly by requiring (M?1 , M12 ) as the witness in Definition 1. Note that while for some situations M?1 or M?12 suffice, the interplay between them can become important, in particular when putting restrictions on the classes of mappings used. In the next section we use source-to-target χ-representations to characterize safety. 4 Characterizing Safety In this section, we show that the notion of safety, which was developed in [5] in the context of knowledge exchange, can be characterized by the notion of general represen- tation introduced in the previous section. In what follows, we assume that a knowledge base K = (I, Σ) over a schema S is given by an instance I of S, called the explicit knowledge, and a set Σ of dependencies over S, called the implicit knowledge. Moreover, we associate to Σ a mapping Map(Σ) = {(I, J) | I, J ∈ Inst(S), I ⊆ J and J  Σ}. Finally, we define the set of models of K = (I, Σ) as Mod(K) = SolMap(Σ) (I). In order to define the notion of safety proposed in [5], we need to introduce some terminology. Given a set I of instances of a schema S, we define Min(I) as the set of minimal instances in I under the subset relation, that is, as {I ∈ I | there is no I 0 ∈ I such that I 0 ( I}. Moreover, given a mapping M from a schema S1 to a schema S2 and knowledge bases K1 = (I1 , Σ1 ), K2 = (I2 , Σ2 ) over S1 and S2 , respectively, K2 is called a minimal knowledge-base solution for K1 under M if Min(Mod(K2 )) = Min(SolM (Mod(K1 ))) [5]. With these concepts, we can formally define the notion of safety. Definition 2 ([5]). Let M be a schema mapping from a schema S1 to a schema S2 , and Σ1 , Σ2 be sets of dependencies over S1 and S2 , respectively. Then Σ2 is said to be safe for Σ1 under M, if for every instance I1 of S1 , there exists an instance I2 of S2 such that (I2 , Σ2 ) is a minimal knowledge-base solution for (I1 , Σ1 ) under M. In order to capture safety by using the notion of general representability, we first need to choose the appropriate notion of equivalence to be used in equation (‡). It is clear that the notion of minimality plays a major role in Definition 2, thus equivalence with 4 respect to minimal solutions is a natural choice. More precisely, for I ∈ Inst(S1 ), we define MinSolM (I) = Min(SolM (I)) and say that M12 ≡MinSol M012 if for every I ∈ Inst(S1 ), it holds that MinSolM12 (I) = MinSolM012 (I). It is important to notice that the original definition of safety given in [5] only considered full tuple-generating dependencies (that is, Σ1 and M in Definition 2 were required to be based on these dependencies). A consequence of this restriction is that only mappings that are absolutely consistent, in the sense defined in [2], were considered in the original definition of safety. So when capturing this notion we focus on this class of mappings. More precisely, a mapping M from a schema S1 to a schema S2 is said to be absolutely consistent if for every I ∈ Inst(S1 ), it holds that SolM (I) 6= ∅. Then we have that: Proposition 1. Let Σ1 , Σ2 be sets of dependencies over schemas S1 and S2 , respec- tively, and assume that M12 is a mapping such that Map(Σ1 ) ◦ M12 is absolutely consistent. Then Σ2 is safe for Σ1 under M12 iff Map(Σ2 ) is a source-to-target MinSol- representation of Map(Σ1 ) under M12 . It follows that safety as defined in [5] can be captured by source-to-target MinSol- representability, as M1 ◦ M12 is absolutely consistent in the definition given in [5] (since M1 and M12 are required to be based on full tuple-generating dependencies in [5]). Interestingly, the previous result also holds for mappings based on tuple-generating dependencies with terminating chase. 5 On the Equivalence of Mappings To capture notions like safety or Q-representability within the framework of general representability, one of the key ingredients is choosing the right notion of equivalence of mappings. In this section, we present the first steps towards the development of some appropriate notions of equivalence of mappings, which connect minimal solutions (used to characterize safety in Proposition 1) to notions of equivalence relevant for Q-representability. A key role in this connection is played by a generalization of the widely used notion of universal solution [6], which is defined next. Let I be a set of instances, I1 , I2 ∈ I, and let C be a set of constants. A homomorphism from I1 to I2 that is constant on C is a function h : dom(I1 ) → dom(I2 ) such that for all c ∈ (C ∩ dom(I1 )) we have that h(c) = c, and for all (x1 , . . . , xn ) ∈ RI1 we have that (h(x1 ), . . . , h(xn )) ∈ RI2 . We write I1 C I2 if such an h exists. We say that I1 ≺C I2 if I1 C I2 and I2 6C I1 . Then let HomC (I) = {I ∈ I | there is no I 0 ∈ I such that I 0 ≺C I}. Universal solutions are defined as the minimum elements of a set I of solutions up to equivalence under homomorphisms, while the elements of HomC (I) are defined as the minimal elements of I up to equivalence under homomorphisms. Furthermore, given a pair I1 , I2 of instances of a schema S and a set C of constants, we say that I1 is a C-core of I2 if [10, 8]: (1) I1 is a sub-instance of I2 , (2) I2 C I1 , and (3) there is no homomorphism from I2 to a proper sub-instance of I1 that is the identity on C. Then let CoreC (I) = {I ∈ Inst(S) | I is a C-core of some I 0 ∈ I}. We are now ready to define the notions of equivalence of mappings studied in this section. For I ∈ Inst(S1 ) and mapping M from S1 , we define HomSolM (I) = 5 Homdom(I) (SolM (I)) and CoreHomSolM (I) = Coredom(I) (HomSolM (I))). More- over, M12 ≡CoreHomSol M012 if for all I ∈ Inst(S1 ) we have that CoreHomSolM12 (I) = CoreHomSolM012 (I). Next we introduce conditions on mappings that allow us to relate the newly introduced notion to the notion of MinSol we used before. Definition 3. Let M be a mapping from a schema S1 to a schema S2 . Then: M is domain-full if for all I ∈ Inst(S1 ) and J ∈ HomSolM (I), there exists J 0 ∈ HomSolM (I) such that J 0 ⊆ J and dom(J 0 ) ⊆ dom(I); M is founded if for all I ∈ Inst(S1 ) and J ∈ SolM (I), there exists J 0 ∈ HomSolM (I) such that J 0 dom(I) J; and M admits core-solutions if for all I ∈ Inst(S1 ) we have CoreHomSolM (I) ⊆ SolM (I). Proposition 2. Let M and M0 be mappings from a schema S1 to a schema S2 such that M is domain-full, founded, and admits core-solutions. Then it holds that M ≡MinSol M0 iff M ≡CoreHomSol M0 . So for safety as originally defined in [5], we thus know that the following holds: Corollary 1. Let Σ1 be a set of full tgds over a schema S1 , Σ2 a set of dependencies over a schema S2 and M12 a mapping from S1 to S2 based on full s-t tgds. Then Σ2 is safe for Σ1 under M12 iff Map(Σ2 ) is a source-to-target CoreHomSol-representation of Map(Σ1 ) under M12 . We have not yet developed a similar result for the notion of Q-representability. But towards that, we studied the connection between equivalence based on CoreHomSol and notions based on query languages relevant for Q-representability, that we introduce next. We assume that a k-ary query q over a schema S is a function that maps every instance I of S to a k-ary relation q(I) contained in dom(I)k . Moreover, given a set I of instances of S and a k-ary query T q over S, the set of certain answers of q over I, denoted by certq (I), is defined as I∈I q(I). Let Q be a class of queries, and M, M0 be mappings from a schema S1 to a schema S2 . In the spirit of [7], we say that M ≡Q M0 if for all q ∈ Q over S2 and I ∈ Inst(S1 ), it holds certq (SolM (I)) = certq (SolM0 (I)). A k-ary query q over a schema S is said to be closed under homomorphisms if for every pair of instances I1 , I2 of S, tuple (a1 , . . . , ak ) ∈ q(I1 ) and homomorphism h from I1 to I2 , it holds that (h(a1 ), . . . , h(ak )) ∈ q(I2 ). Let CHQ be the class of queries that are closed under homomorphisms, and UCQ be the class of unions of conjunctive queries. Again, to formally state the relations between notions of equivalence of mappings, we define some natural conditions and the necessary terminology. We use notation I1 ∼ =C I2 to indicate that there exists an isomorphism from I1 to I2 that is the identity on C, and we use notation I1 ↔C I2 to indicate that I1 C I2 and I2 C I1 . Besides, we use notation I1 ∼ = I2 as a shorthand for I1 ∼ =∅ I2 . Definition 4. Let M be a mapping from a schema S1 to a schema S2 . Then: M is closed under isomorphisms if for all I1 , I2 ∈ Inst(S1 ) and J1 , J2 ∈ Inst(S2 ), if J1 ∈ SolM (I1 ) and (I1 , J1 ) ∼ = (I2 , J2 ), then J2 ∈ SolM (I2 ); M is target domain-independent if for all I ∈ Inst(S1 ) and J1 , J2 ∈ HomSolM (I), it holds that dom(J1 ) ∩ dom(I) = dom(J2 ) ∩ dom(I); and M has a finite base if for every I ∈ Inst(S1 ), it holds that HomSolM (I) is finite up to ↔dom(I) . 6 Theorem 1. Let M, M0 be mappings from a schema S1 to a schema S2 , such that M, M0 are closed under isomorphisms, founded and target domain-independent. Then M ≡CHQ M0 iff M ≡CoreHomSol M0 As a corollary of the proof of Theorem 1, we have the following for ≡UCQ , the most studied notion of equivalence in the context of Q-representability: Corollary 2. Let M, M0 be mappings from a schema S1 to a schema S2 , such that M, M0 are closed under isomorphisms, are founded, are target domain-independent and have a finite base. Then M ≡UCQ M0 iff M ≡CoreHomSol M0 . 6 Concluding remarks In this paper, we introduced the framework of general representability that allows us to study notions of representability like safety and Q-representability. We intro- duced notions of equivalence suitable for capturing safety, and defined conditions under which these characterizations apply. In particular, we were able to capture safety as originally defined. We also started introducing the necessary machinery for studying Q-representability, with a focus on the UCQ and the more general CHQ case. As future work, we intend to capture Q-representability, and continue using the framework of general representability to develop notions of representation both for specific applications like knowledge exchange as well as in the general case. The study of the conditions and notions of equivalence we introduced will also be an interesting endeavor in its own right, as it allows us to find candidates for conditions which schema mapping formalisms should satisfy. References 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995) 2. Amano, S., Libkin, L., Murlak, F.: XML schema mappings. In: PODS. pp. 33–42 (2009) 3. Arenas, M., Botoeva, E., Calvanese, D.: Knowledge base exchange. In: DL (2011) 4. Arenas, M., Botoeva, E., Calvanese, D., Ryzhikov, V., Sherkhonov, E.: Exchanging description logic knowledge bases. In: KR (2012) 5. Arenas, M., Pérez, J., Reutter, J.L.: Data exchange beyond complete data. In: PODS. pp. 83–94 (2011) 6. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005) 7. Fagin, R., Kolaitis, P.G., Nash, A., Popa, L.: Towards a theory of schema-mapping optimiza- tion. In: PODS. pp. 33–42 (2008) 8. Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: getting to the core. ACM Trans. Database Syst. 30(1), 174–210 (2005) 9. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings: Second-order dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005) 10. Hell, P., Nesetril, J.: The core of a graph. Discrete Mathematics 109(1–3), 117–126 (1992) 11. Hernich, A., Schweikardt, N.: Logic and data exchange: Which solutions are ”good” solutions? In: LOFT. pp. 61–85 (2008) 7