Towards General Representability in
                         Knowledge Exchange

                   Marcelo Arenas1 , Jorge Pérez2 , and Emanuel Sallinger3
               1
                   Pontificia Universidad Católica de Chile & University of Oxford
                    2
                       Department of Computer Science, Universidad de Chile
                                3
                                   Vienna University of Technology


1     Introduction

In data exchange, one is typically given a source database instance and a mapping
between a source schema and a target schema. The goal then is to materialize a target
database instance that corresponds to the source instance and the mapping. In this setting
source data is explicitly given, that is, every fact that is true in it is explicitly mentioned
in the source database instance.
     Knowledge bases typically consist of some explicit knowledge – like in database
instances – and some implicit knowledge, usually given in the form of rules that specify
how to derive knowledge not explicitly stored. In [5], Arenas et al. introduced the setting
of knowledge exchange, where one is given a source knowledge base and a mapping
between a source schema and a target schema, and the goal is to materialize a target
knowledge base, that is, both explicit and implicit knowledge.
     While the question of what constitutes a “good” solution to a data exchange problem
has been the topic of research in recent years [11], the question of what constitutes a
“good” solution to a knowledge exchange problem is rather new. In particular, there are
now two components, explicit and implicit knowledge, which play rather different roles
in what we expect from them.
     In [5], two especially desirable properties of the target knowledge base were iden-
tified. First, one generally wants to minimize explicit knowledge, thus generating as
much knowledge by rules as possible. Second, in typical knowledge-based systems (such
as those based on RDFS, OWL, or general description logics), the explicit knowledge
changes frequently, but the implicit knowledge remains constant over a longer time.
Hence, it is desirable to let the target implicit knowledge only depend on the implicit
knowledge at the source and the mapping between source and target. Thus, knowledge
exchange effectively becomes a two-stage process:

 1. materialize target implicit knowledge, given only the source implicit knowledge and
    the source-to-target mapping; and
 2. materialize target explicit knowledge w.r.t. the materialized implicit knowledge,
    which should be as small as possible.
    M. Arenas was supported by Fondecyt grant #1110287, J. Pérez was supported by Fondecyt
    grant #11110404, the research of E. Sallinger is supported by the Austrian Science Fund
    (FWF):P25207-N23.
Here, we focus on the former, and thus say that such target implicit knowledge “rep-
resents” the source implicit knowledge under a given mapping. Two such notions of
“representation” were introduced recently: safety [5] and Q-representability [3, 4], where
Q is a query language. In this work, we introduce the very broad notion of general
representability that captures safety as a special case and lays the foundation to the study
of representability in a broad setting, in particular with the goal of extending our work to
Q-representability. To achieve this, general representability is not limited to knowledge
exchange, but based on arbitrary schema mappings. As an essential tool for this study, we
develop conditions on schema mapping languages under which notions of equivalence
important to our investigation coincide.
Organization. In Section 2, we define the necessary preliminaries. We start introducing
general representability in Section 3. Then, in Section 4, we show how to capture safety.
In Section 5, we introduce notions of equivalence towards capturing Q-representability.
We conclude in Section 6.

2      Preliminaries
We assume some familiarity with database theory and data exchange (cf., e.g., [1, 6]). A
schema S is a set of relation symbols, each having a fixed arity. An instance I ∈ Inst(S)
assigns to each relation symbol R a relation RI of the corresponding arity. We denote
by dom(I) the set of all constants that occur in a relation of I. We assume instances to
be finite and, thus, dom(I) is finite.
    A schema mapping M (or just mapping) from a source schema S1 to a target
schema S2 is a subset of Inst(S1 ) × Inst(S2 ).4 The composition of two mappings,
symbolized by ◦, is defined as the composition of binary relations [9]. M and M0
are equivalent, written M ≡ M0 , if they are equal as binary relations (M = M0 ).
Besides, we write M  M0 for M ⊆ M0 . We later use the notion of a mapping
M “based on” some logical formalism. By that we mean that there exists a set Σ of
dependencies from that formalism s.t. M = {(I, J) | (I, J)  Σ}. As these specific
formalisms are not the focus of this paper, we refer to e.g. [5] for formal details and
definitions. We say that Σ is over a schema S if all relation symbols occurring in Σ
are from S. For a set of instances I ⊆ Inst(S), the set of solutions of I under M is
SolM (I) = {J | (I, J) ∈ M and I ∈ I}. A single instance I ∈ Inst(S1 ) is treated as
a singleton set {I} wherever applicable, e.g. SolM (I) = SolM ({I}).
    In this paper we deal with knowledge bases in a very general way treating them
as mappings over a single schema. A general knowledge base over a schema S is
represented simply as a mapping M from S to S such that, intuitively, if I represents
the explicit knowledge, then SolM (I) represents all possible completions of I where the
implicit knowledge has been made explicit. For keeping schema specifications brief, a
subscripted mapping Mi is always from schema Si to Si , and a mapping Mij is always
from schema Si to Sj .


 4
     To adapt to the knowledge exchange setting [5], we do not explicitly consider labeled nulls
     in target instances, however the concept is preserved as constants only occurring in a target
     instance are essentially treated as labeled nulls in data exchange [6].


                                                 2
3    General Representability
Intuitively, given mappings M1 , M2 and M12 , we say that M2 “represents” M1 under
M12 if the following equivalence is fulfilled:
                                M1 ◦ M12 ≡ M12 ◦ M2 .                                      (†)
In the context of knowledge exchange, M1 and M2 can be seen as the mappings
responsible for generating the completions discussed in the preliminaries, and M12 as
the mapping used for transferring data. The condition (†) is very strong. In particular,
M12 must already transfer enough of the source data for such an M2 to exist. A very
critical point is that we require “logical” equivalence, i.e, for every source instance all the
solutions must coincide. It can be demonstrated in simple examples that this requirement
is too strong to be widely applicable.
     Less strict notions of representability have been studied in recent research, namely
safety [5] in the context of full tuple-generating dependencies [8], and Q-representability
– where Q stands for a class of queries – in the context of description logics [3, 4]. We
therefore introduce general representability, a very weak notion of representability which
allows us to study relaxed notions of representability in a very general setting.
Definition 1. Let ≡χ be an equivalence relation on schema mappings. If there exist
mappings M?1 and M?12 such that
                           M1 ◦ M12 ≡χ M?1 ◦ M?12 ◦ M2 ,                                   (‡)
then M2 is called a general χ-representation of M1 under M12 witnessed by the pair
of mappings (M?1 , M?12 ).
The intuition behind this definition is the following: M2 may not be the “ideal” repre-
sentation as defined by equation (†). Though, if through keeping small parts of M1 and
by slightly extending M12 we can show equivalence, this may be good enough for a
given application. In particular, it might not be possible to completely remove M1 , but
we may have to keep a “part” of M1 , namely M?1 . Second, it might not be possible to
use M12 , but we may need a slightly “extended” mapping M?12 . Of course, to study
representability in a general setting, we do not formally constrain in this definition how
these “parts” and “extensions” must look like. The third way we may adapt our equation
to some application is by using an appropriate equivalence relation ≡χ , e.g., by using the
so-called data exchange equivalence [7] if our application is related to data exchange.
    Given this general definition, one has a lot of freedom in choosing M?1 and M?12 .
In particular, by choosing M?1 = M1 and M?12 = M12 , even the identity mapping
M2 = Id2 = {(I, I) | I ∈ Inst(S2 )} is a general representation of M1 under M12 . On
the other end of the spectrum, there is a clear “best” kind of general representation M2 ,
namely if one can choose M?1 = Id1 = {(I, I) | I ∈ Inst(S1 )} and M?12 = M12 . If
these two equalities hold, we speak of a strong χ-representation. We thus described what
one could consider “trivial” and “best” representations, however there is much room
between those two extremes.
    For example, it is reasonable to require that while M?1 ◦ M?12 should be able to be
more restrictive than M12 , it should not be more restrictive than M1 ◦ M12 together.
Altogether, one can thus require that:

                                              3
                 M1 ◦ M12  M?1 ◦ M?12             and   M?1 ◦ M?12  M12 .
When these conditions hold, we talk about a weak χ-representation. We intend to use
weak χ-representations for characterizing Q-representability in future work.
    The intuition behind the notion of weak representability is to offer an intermediate
amount of freedom between strong representations and unconstrained general representa-
tions. In particular, the witness (M?1 , M?12 ) has to be at least as restrictive as the witness
of a strong representation (M?1 ◦ M?12  M12 ) and at most as restrictive as the witness
of a “trivial” representation (M1 ◦ M12  M?1 ◦ M?12 ). These conditions allow to rule
out problematic behaviour that may occur in a general setting, but not in the context of
e.g. Q-representability.
    An orthogonal restriction is to allow either only a source-to-target witness M?12 or
only a source witness M?1 in Definition 1, giving rise to the notions of source-to-target
χ-representation (M?1 = Id1 ) and source χ-representation (M?12 = M12 ). Formally,
M2 is a source-to-target χ-representation of M1 under M12 if M2 is a general χ-
representation witnessed by the pair of mappings (Id1 , M?12 ). Source χ-representation
can be defined similarly by requiring (M?1 , M12 ) as the witness in Definition 1. Note
that while for some situations M?1 or M?12 suffice, the interplay between them can
become important, in particular when putting restrictions on the classes of mappings
used. In the next section we use source-to-target χ-representations to characterize safety.

4    Characterizing Safety
In this section, we show that the notion of safety, which was developed in [5] in the
context of knowledge exchange, can be characterized by the notion of general represen-
tation introduced in the previous section. In what follows, we assume that a knowledge
base K = (I, Σ) over a schema S is given by an instance I of S, called the explicit
knowledge, and a set Σ of dependencies over S, called the implicit knowledge. Moreover,
we associate to Σ a mapping Map(Σ) = {(I, J) | I, J ∈ Inst(S), I ⊆ J and J  Σ}.
Finally, we define the set of models of K = (I, Σ) as Mod(K) = SolMap(Σ) (I).
     In order to define the notion of safety proposed in [5], we need to introduce some
terminology. Given a set I of instances of a schema S, we define Min(I) as the set
of minimal instances in I under the subset relation, that is, as {I ∈ I | there is
no I 0 ∈ I such that I 0 ( I}. Moreover, given a mapping M from a schema S1
to a schema S2 and knowledge bases K1 = (I1 , Σ1 ), K2 = (I2 , Σ2 ) over S1 and
S2 , respectively, K2 is called a minimal knowledge-base solution for K1 under M if
Min(Mod(K2 )) = Min(SolM (Mod(K1 ))) [5]. With these concepts, we can formally
define the notion of safety.

Definition 2 ([5]). Let M be a schema mapping from a schema S1 to a schema S2 , and
Σ1 , Σ2 be sets of dependencies over S1 and S2 , respectively. Then Σ2 is said to be safe
for Σ1 under M, if for every instance I1 of S1 , there exists an instance I2 of S2 such
that (I2 , Σ2 ) is a minimal knowledge-base solution for (I1 , Σ1 ) under M.

In order to capture safety by using the notion of general representability, we first need
to choose the appropriate notion of equivalence to be used in equation (‡). It is clear
that the notion of minimality plays a major role in Definition 2, thus equivalence with


                                               4
respect to minimal solutions is a natural choice. More precisely, for I ∈ Inst(S1 ), we
define MinSolM (I) = Min(SolM (I)) and say that M12 ≡MinSol M012 if for every
I ∈ Inst(S1 ), it holds that MinSolM12 (I) = MinSolM012 (I).
    It is important to notice that the original definition of safety given in [5] only
considered full tuple-generating dependencies (that is, Σ1 and M in Definition 2 were
required to be based on these dependencies). A consequence of this restriction is that
only mappings that are absolutely consistent, in the sense defined in [2], were considered
in the original definition of safety. So when capturing this notion we focus on this class
of mappings. More precisely, a mapping M from a schema S1 to a schema S2 is said to
be absolutely consistent if for every I ∈ Inst(S1 ), it holds that SolM (I) 6= ∅. Then we
have that:

Proposition 1. Let Σ1 , Σ2 be sets of dependencies over schemas S1 and S2 , respec-
tively, and assume that M12 is a mapping such that Map(Σ1 ) ◦ M12 is absolutely
consistent. Then Σ2 is safe for Σ1 under M12 iff Map(Σ2 ) is a source-to-target MinSol-
representation of Map(Σ1 ) under M12 .

It follows that safety as defined in [5] can be captured by source-to-target MinSol-
representability, as M1 ◦ M12 is absolutely consistent in the definition given in [5]
(since M1 and M12 are required to be based on full tuple-generating dependencies in
[5]). Interestingly, the previous result also holds for mappings based on tuple-generating
dependencies with terminating chase.

5    On the Equivalence of Mappings
To capture notions like safety or Q-representability within the framework of general
representability, one of the key ingredients is choosing the right notion of equivalence
of mappings. In this section, we present the first steps towards the development of
some appropriate notions of equivalence of mappings, which connect minimal solutions
(used to characterize safety in Proposition 1) to notions of equivalence relevant for
Q-representability.
     A key role in this connection is played by a generalization of the widely used notion
of universal solution [6], which is defined next. Let I be a set of instances, I1 , I2 ∈ I,
and let C be a set of constants. A homomorphism from I1 to I2 that is constant on C is
a function h : dom(I1 ) → dom(I2 ) such that for all c ∈ (C ∩ dom(I1 )) we have that
h(c) = c, and for all (x1 , . . . , xn ) ∈ RI1 we have that (h(x1 ), . . . , h(xn )) ∈ RI2 . We
write I1 C I2 if such an h exists. We say that I1 ≺C I2 if I1 C I2 and I2 6C I1 . Then
let HomC (I) = {I ∈ I | there is no I 0 ∈ I such that I 0 ≺C I}. Universal solutions
are defined as the minimum elements of a set I of solutions up to equivalence under
homomorphisms, while the elements of HomC (I) are defined as the minimal elements of
I up to equivalence under homomorphisms. Furthermore, given a pair I1 , I2 of instances
of a schema S and a set C of constants, we say that I1 is a C-core of I2 if [10, 8]: (1) I1
is a sub-instance of I2 , (2) I2 C I1 , and (3) there is no homomorphism from I2 to a
proper sub-instance of I1 that is the identity on C. Then let CoreC (I) = {I ∈ Inst(S) | I
is a C-core of some I 0 ∈ I}.
     We are now ready to define the notions of equivalence of mappings studied in
this section. For I ∈ Inst(S1 ) and mapping M from S1 , we define HomSolM (I) =


                                              5
Homdom(I) (SolM (I)) and CoreHomSolM (I) = Coredom(I) (HomSolM (I))). More-
over, M12 ≡CoreHomSol M012 if for all I ∈ Inst(S1 ) we have that CoreHomSolM12 (I) =
CoreHomSolM012 (I). Next we introduce conditions on mappings that allow us to relate
the newly introduced notion to the notion of MinSol we used before.

Definition 3. Let M be a mapping from a schema S1 to a schema S2 . Then: M
is domain-full if for all I ∈ Inst(S1 ) and J ∈ HomSolM (I), there exists J 0 ∈
HomSolM (I) such that J 0 ⊆ J and dom(J 0 ) ⊆ dom(I); M is founded if for all I ∈
Inst(S1 ) and J ∈ SolM (I), there exists J 0 ∈ HomSolM (I) such that J 0 dom(I) J; and
M admits core-solutions if for all I ∈ Inst(S1 ) we have CoreHomSolM (I) ⊆ SolM (I).

Proposition 2. Let M and M0 be mappings from a schema S1 to a schema S2 such that
M is domain-full, founded, and admits core-solutions. Then it holds that M ≡MinSol M0
iff M ≡CoreHomSol M0 .

So for safety as originally defined in [5], we thus know that the following holds:

Corollary 1. Let Σ1 be a set of full tgds over a schema S1 , Σ2 a set of dependencies
over a schema S2 and M12 a mapping from S1 to S2 based on full s-t tgds. Then Σ2 is
safe for Σ1 under M12 iff Map(Σ2 ) is a source-to-target CoreHomSol-representation
of Map(Σ1 ) under M12 .

We have not yet developed a similar result for the notion of Q-representability. But
towards that, we studied the connection between equivalence based on CoreHomSol and
notions based on query languages relevant for Q-representability, that we introduce next.
We assume that a k-ary query q over a schema S is a function that maps every instance I
of S to a k-ary relation q(I) contained in dom(I)k . Moreover, given a set I of instances
of S and a k-ary query     T q over S, the set of certain answers of q over I, denoted by
certq (I), is defined as I∈I q(I). Let Q be a class of queries, and M, M0 be mappings
from a schema S1 to a schema S2 . In the spirit of [7], we say that M ≡Q M0 if for all
q ∈ Q over S2 and I ∈ Inst(S1 ), it holds certq (SolM (I)) = certq (SolM0 (I)). A k-ary
query q over a schema S is said to be closed under homomorphisms if for every pair of
instances I1 , I2 of S, tuple (a1 , . . . , ak ) ∈ q(I1 ) and homomorphism h from I1 to I2 , it
holds that (h(a1 ), . . . , h(ak )) ∈ q(I2 ). Let CHQ be the class of queries that are closed
under homomorphisms, and UCQ be the class of unions of conjunctive queries.
    Again, to formally state the relations between notions of equivalence of mappings,
we define some natural conditions and the necessary terminology. We use notation
I1 ∼
   =C I2 to indicate that there exists an isomorphism from I1 to I2 that is the identity
on C, and we use notation I1 ↔C I2 to indicate that I1 C I2 and I2 C I1 . Besides,
we use notation I1 ∼ = I2 as a shorthand for I1 ∼     =∅ I2 .
Definition 4. Let M be a mapping from a schema S1 to a schema S2 . Then: M is closed
under isomorphisms if for all I1 , I2 ∈ Inst(S1 ) and J1 , J2 ∈ Inst(S2 ), if J1 ∈ SolM (I1 )
and (I1 , J1 ) ∼
               = (I2 , J2 ), then J2 ∈ SolM (I2 ); M is target domain-independent if
for all I ∈ Inst(S1 ) and J1 , J2 ∈ HomSolM (I), it holds that dom(J1 ) ∩ dom(I) =
dom(J2 ) ∩ dom(I); and M has a finite base if for every I ∈ Inst(S1 ), it holds that
HomSolM (I) is finite up to ↔dom(I) .


                                              6
Theorem 1. Let M, M0 be mappings from a schema S1 to a schema S2 , such that
M, M0 are closed under isomorphisms, founded and target domain-independent. Then
M ≡CHQ M0 iff M ≡CoreHomSol M0

As a corollary of the proof of Theorem 1, we have the following for ≡UCQ , the most
studied notion of equivalence in the context of Q-representability:

Corollary 2. Let M, M0 be mappings from a schema S1 to a schema S2 , such that M,
M0 are closed under isomorphisms, are founded, are target domain-independent and
have a finite base. Then M ≡UCQ M0 iff M ≡CoreHomSol M0 .


6    Concluding remarks
In this paper, we introduced the framework of general representability that allows
us to study notions of representability like safety and Q-representability. We intro-
duced notions of equivalence suitable for capturing safety, and defined conditions under
which these characterizations apply. In particular, we were able to capture safety as
originally defined. We also started introducing the necessary machinery for studying
Q-representability, with a focus on the UCQ and the more general CHQ case.
    As future work, we intend to capture Q-representability, and continue using the
framework of general representability to develop notions of representation both for
specific applications like knowledge exchange as well as in the general case. The study
of the conditions and notions of equivalence we introduced will also be an interesting
endeavor in its own right, as it allows us to find candidates for conditions which schema
mapping formalisms should satisfy.


References
 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley (1995)
 2. Amano, S., Libkin, L., Murlak, F.: XML schema mappings. In: PODS. pp. 33–42 (2009)
 3. Arenas, M., Botoeva, E., Calvanese, D.: Knowledge base exchange. In: DL (2011)
 4. Arenas, M., Botoeva, E., Calvanese, D., Ryzhikov, V., Sherkhonov, E.: Exchanging description
    logic knowledge bases. In: KR (2012)
 5. Arenas, M., Pérez, J., Reutter, J.L.: Data exchange beyond complete data. In: PODS. pp.
    83–94 (2011)
 6. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: semantics and query answering.
    Theor. Comput. Sci. 336(1), 89–124 (2005)
 7. Fagin, R., Kolaitis, P.G., Nash, A., Popa, L.: Towards a theory of schema-mapping optimiza-
    tion. In: PODS. pp. 33–42 (2008)
 8. Fagin, R., Kolaitis, P.G., Popa, L.: Data exchange: getting to the core. ACM Trans. Database
    Syst. 30(1), 174–210 (2005)
 9. Fagin, R., Kolaitis, P.G., Popa, L., Tan, W.C.: Composing schema mappings: Second-order
    dependencies to the rescue. ACM Trans. Database Syst. 30(4), 994–1055 (2005)
10. Hell, P., Nesetril, J.: The core of a graph. Discrete Mathematics 109(1–3), 117–126 (1992)
11. Hernich, A., Schweikardt, N.: Logic and data exchange: Which solutions are ”good” solutions?
    In: LOFT. pp. 61–85 (2008)


                                                 7