<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Open and Closed World Assumptions in Data Exchange</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonid Libkin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Sirangelo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LSV, ENS-Cachan, CNRS and INRIA</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data exchange is the problem of finding an instance of a target schema, given an instance of a source schema and a specification of a mapping between the source and the target schemas, and answering queries over target instances. A schema mapping is a condition, often expressed in a fragment of first-order logic, that relates instances of source and target schemas. Most commonly it is a conjunction of sentences of the form</p>
      </abstract>
      <kwd-group>
        <kwd>∀x¯ ϕ(x¯) → ∃z¯ ψ(x¯</kwd>
        <kwd>z¯)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>(1)
where ϕ and ψ are both conjunctions of atomic formulae. For example, if we
have a source database S(·, ·) that lists employees and their salaries, and a target
database T (·, ·) that lists people and their children, then a natural mapping is
∀x∀y (S(x, y) → ∃z T (x, z)).</p>
      <p>
        The study of both data exchange and schema mappings has been actively
pursued recently (see, e.g., recent surveys [
        <xref ref-type="bibr" rid="ref11 ref3 ref4">3, 11, 4</xref>
        ]). Existing implementations
have been incorporated into major database products.
      </p>
      <p>
        Theoretical foundations of data exchange were developed in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. For a
source instance S and a schema mapping M , a target instance T is a
solution for S if S and T together satisfy the conditions of M . For instance, if we
have a mapping given by a conjunction of sentences of the form (1), it means
that for each tuple a¯ making ϕ(a¯) true in the source, one can find a tuple ¯b of
values in the target database so that the conjunction ψ(a¯, ¯b) is true in the target.
      </p>
      <p>
        This semantics implies that for the same source instance one can have many
different target solutions. In fact this semantics of data exchange follows the so
called open world assumption (OWA), as target solutions may contain arbitrary
facts, some of which are not directly implied by the source data and the schema
mapping. An alternative approach, developed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] adopts the so called closed
world assumption (CWA) [14], for defining the semantics of data exchange. Here
intuitively only facts which are a direct “translation” of source data are allowed
in the target.
      </p>
      <p>If a query Q is posed against target instances, then it must be answered in a
way that is consistent with the source. A common approach is to look for certain
answers:
certainM (Q, S) = \{Q(T ) | T is a solution}.
(2)
A desirable feature in data exchange is to find a particular solution to be
materialized in the target, which allows to answer queries over the target (with certain
answer semantics), with no access to source data. In other words, we want a
specific target instance T0 so that certainM (Q, S) = Q′(T0), for some query Q .
′</p>
      <p>
        Data exchange literature identified two particularly nice solutions, called the
canonical solution [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the core [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It was shown in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] that, under the
OWA, certain answers (2) can be found in polynomial time if Q is a union of
conjunctive queries (i.e., in the (∃, ∧, ∨)-fragment of FO). In fact, in that case
it is easy to construct a query Q′ so that certainM (Q, S) = Q′(T0) if T0 is either
the canonical solution, or the core.
      </p>
      <p>While this looks reasonable, the definition of query answering is actually
underspecified. We know well how to answer queries on complete databases; in
contrast, databases arising in data exchange are often incomplete as they contain
null values (for example, it is common for target schemas to have attributes that
do not correspond to any attributes in the source schema; these are populated
with nulls). Thus, the notions of query answering over incomplete databases are
essential in data exchange, and have to be incorporated into the algorithms for
answering queries in data exchange.</p>
      <p>
        Let us briefly recall the standard approach to query answering over databases
with incomplete information [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Such a database is formally a relational
structure over the disjoint union of two domains: D of data values, and V of variables
(or nulls, usually denoted by the symbol ⊥ with subscripts). For such a database
R, we let Rep(R) denote the set of all complete databases (over D) that it
represents. This can be interpreted again under two assumptions: under the closed
world assumption, Rep(R) consists of all databases obtained by applying
mappings V → D to R; under the open world assumption, Rep(R) is the set of all
databases that contain v(R), for some mapping v : V → D.
      </p>
      <p>
        The answers to a query Q over R are defined as
That is, 2Q(R) contains answers independent of the interpretation of nulls. One
easy case of finding such answers is known: if Q is a union of conjunctive queries,
then 2Q(R), under both closed and open world assumptions, is obtained by a
straightforward evaluation of Q on R and then discarding tuples that contain
nulls (elements of V ). This is referred to as naive evaluation. Clearly, naive
evaluation is done in PTIME; adding negation to queries (e.g., going to the full
FO) makes the problem coNP-complete under the closed world assumption [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and undecidable under the OWA.
      </p>
      <p>
        The theory of incomplete information explains why the OWA works well for
answering unions of conjunctive queries in data exchange. In fact the canonical
solution in data exchange, viewed as an open-world incomplete database, turns
out to represent the set of all data exchange solutions under the OWA (property
already referred to as universality in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). It follows that, if Tc is the canonical
solution, 2Q(Tc) computes data exchange certain answers certainM (Q, S) under
the OWA. If Q is a union of conjunctive query, all one needs to do is to compute
the canonical solution, and naively evaluate the query over it – then results of
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] mentioned above guarantee that the process computes certain answers.
      </p>
      <p>
        However, going beyond unions of conjunctive queries poses problems. They
appear even in the simplest mappings of the form ∀x¯ R(x¯) → R′(x¯) that copy
contents of each relation R into a new relation R′. Under the OWA, even FO
query answering in data exchange is undecidable, and for every Boolean query,
either its certain answers are false for all instances, or the certain answers for its
negation are false for all instances [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There are other examples of anomalous
behavior of query answering in data exchange.
      </p>
      <p>The reason behind such anomalies is that our intuition tells us that the
mapping ∀x¯ R(x¯) → R′(x¯) forces R′ to be a copy of R; however, under the OWA
semantics, R′ is allowed to be an arbitrary extension of R. It seems natural then
′
to adopt the CWA that would not permit adding arbitrary facts to R and would
only put there what the constraints dictate – that is, a copy of R.</p>
      <p>
        This approach was developed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It introduced the notion of
CWAsolutions based on three assumptions. First, every fact put in a target instance
must be justified. Second, every justification for putting a tuple into the
target should not generate multiple nulls. And third, every fact true in the target
instance must be derivable from the source instance and constraints in the
mappings. When these are properly formalized, the CWA-solutions can be
characterized as homomorphic images of the canonical solution that have a homomorphism
back into the canonical solution. It follows that every CWA-solution contains a
copy of the core [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>When the definition of certainM (Q, S) is applied to CWA-solutions, a proper
notion of query answers Q(T ) has to be adopted for CWA-solutions T . The most
natural semantics is 2Q(T ), where T is viewed as a closed-world incomplete
database. This approach naturally leads to a new semantics of query answering:
a tuple t is in certainM (Q, S) under the CWA if, for every CWA-solution T and
every R ∈ Rep(T ), this tuple is in Q(R). That is,</p>
      <p>
        certainCMWA(Q, S) = \{Q(R) | R ∈ Rep(T ), T is a CWA-solution.}
It was shown in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] that the canonical solution is also a CWA-solution and,
viewed as a closed-world incomplete database, represents all databases that are
in Rep(T ) for some CWA-solution T . It follows that, as in the OWA case,
the certain answers can be computed over the canonical solution Tc: namely,
certainCMWA(Q, S) = 2Q(Tc). Hence, the problem of answering queries in data
exchange is reduced to the well-studied problems of answering queries in
closedworld databases with nulls. In particular, from [
        <xref ref-type="bibr" rid="ref1 ref10">1, 10</xref>
        ] one derives that
computing certainCMWA(Q, S) is coNP-complete for arbitrary FO queries and tractable
for unions of conjunctive queries (for which the results coincide under both the
OWA and the CWA).
      </p>
      <p>
        Extensions of this approach to mappings that involve constraints over target
schemas were developed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>But fully open or fully closed mappings, being two extreme cases, are bound
to have their shortcomings. Consider again the example of a source database
S(·, ·) that lists employees and their salaries, and a target database T (·, ·) that
lists people and their children. The mapping is given by ∀x∀y (S(x, y) →
∃z T (x, z)). Under the CWA, if we assume that each employee has one salary,
in the target each employee will have exactly one value for the ‘child’ attribute.
This example indicates that it is natural to declare some target attributes as
open, and some as closed. In the above example, we would formulate the
mapping as ∀x∀y (S(x, y) → ∃z T (xcl, zop)), saying that only employees from the
source are moved to the target, but the number of values of the ‘child’ attribute
are arbitrary.</p>
      <p>
        Such mixed mappings were studied in [13]. Data exchange solutions under
mixed mappings are incomplete databases where domain elements are annotated
as either open (op) or closed (cl ). Such an approach for databases with nulls was
used back in the 80s, see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. If T is such an annotated instance, RepA(T ) is
defined so as to apply the OWA on values annotated as open, and the CWA on
values annotated as closed. In particular, to get an instance in RepA(T ), after
applying a valuation v to T , any tuple in v(T ) can be replicated arbitrarily many
times by changing the values of attributes annotated as open, but keeping the
same value of attributes annotated as closed. For example, RepA({(acl, ⊥op)})
contains all relations R whose projection on the first attribute is {a}. Similarly
RepA({(acl, ⊥1op, ⊥c2l)}) contains all relations whose first projection is {a} and
whose third projection is a singleton. On the contrary, RepA({(acl, ⊥cl)})
contains all one-tuple relations {(a, b)} for some constant b. If all annotations in T
are open (closed, resp), clearly RepA(T ) coincides with Rep(T ) under the OWA
(CWA, resp.).
      </p>
      <p>Annotated data exchange solutions have been defined along the lines of
CWAsolutions. Each tuple put in the target still needs to be justified by some source
data and the schema mapping. The tuple annotation (determined by the schema
mapping) will then give either and open-world or closed-world semantics to the
tuple attributes. The restriction that all facts true in the target must be
implied by the source and the schema mapping is enforced only limited to closed
attributes.</p>
      <p>
        A particular annotated solution, the annotated canonical solution, has the
usual property that it represents the semantics of all the solutions. That is, if Tc
is the annotated canonical solution, RepA(Tc) contains precisely all the instances
R that are in RepA(T ) for some annotated solution T . When annotation is
removed from Tc, this is the usual canonical solution of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], obtained by
disregarding annotation in the schema mapping. This also shows, as proved in
[13], that if a mapping has all open (resp, all closed) annotation, the semantics
of annotated solutions coincides with the semantics of the OWA solutions of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
(respectively CWA-solutions of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]).
      </p>
      <p>The semantics of query answering is standard: for an annotated schema
mapping Mα a tuple t is certainMα (Q, S) if, for every annotated solution T and
every R ∈ RepA(T ), this tuple is in Q(R). Since the semantics of the annotated
canonical solution Tc captures the semantics of all solutions, the usual equation
certainMα (Q, S) = 2Q(Tc) holds also under annotated mappings. Thus the
complexity of computing certain answers for arbitrary FO queries, must range from
coNP-complete (in the case of all ‘closed’ annotation) to undecidable (in the case
of all ‘open’ annotation).</p>
      <p>In particular it was shown that for query answering, there is a complexity
trichotomy that depends on the number of target attributes declared as open
per each atom in a rule in the mapping. If the number is 0 (i.e., we are under
the CWA), then the complexity is in general coNP-complete. If the number is 2
or greater, then, as for the OWA, the problem is undecidable. But if the number
is 1, the problem is decidable, albeit with a high complexity:
coNEXPTIMEcomplete. Note that this means that each atom in a rule has at most one open
annotation; the rule itself (or the mapping) can have arbitrarily many open
annotations. We prohibit atoms such as T (xop, yop) but can have, for example,
T1(xop, ycl) ∧ T1(yop, zcl) ∧ T2(xop, zcl).</p>
      <p>In the case of unions of conjunctive queries, all the cases collapse and the
usual naive evaluation over the canonical solution produces the answer in
polynomial time.</p>
      <p>
        Finally, [13] considered the issue of composing schema mappings, and showed
that the OWA approach of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] extends to CWA mappings, but does not guarantee
composition in general for mixed mappings.
      </p>
      <p>Acknowledgments Supported by EPSRC grants E005039 and F028288, and the
FET-Open Project FoX (grant agreement 233599).
13. L. Libkin and C. Sirangelo. Data exchange and schema mappings in open and
closed worlds In PODS’08, pages 139–148.
14. R. Reiter. On closed world databases. In Logic and Databases, Plenum Press, 1978,
pages 55–76.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanellakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Grahne</surname>
          </string-name>
          .
          <article-title>On the representation and querying of sets of possible worlds</article-title>
          .
          <source>TCS</source>
          <volume>78</volume>
          (
          <year>1991</year>
          ),
          <fpage>158</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          , P. Barcel´o,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>Locally consistent transformations and query answering in data exchange</article-title>
          .
          <source>In PODS 2004</source>
          , pages
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Barcel</surname>
          </string-name>
          <article-title>´o. Logical foundations of relational data exchange</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>38</volume>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          .
          <source>Model management 2</source>
          .
          <article-title>0: manipulating richer mappings</article-title>
          .
          <source>SIGMOD'07</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          , Ph. Kolaitis,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          .
          <article-title>Data exchange: semantics and query answering</article-title>
          .
          <source>Theor. Comput. Sci</source>
          .
          <volume>336</volume>
          (
          <issue>1</issue>
          ):
          <fpage>89</fpage>
          -
          <lpage>124</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          , Ph. Kolaitis,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          .
          <article-title>Data exchange: getting to the core</article-title>
          .
          <source>ACM TODS 30</source>
          (
          <issue>1</issue>
          ):
          <fpage>174</fpage>
          -
          <lpage>210</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          , Ph. Kolaitis,
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.C.</given-names>
            <surname>Tan</surname>
          </string-name>
          .
          <article-title>Composing schema mappings: secondorder dependencies to the rescue</article-title>
          .
          <source>ACM TODS 30</source>
          (
          <issue>4</issue>
          )
          <fpage>994</fpage>
          -
          <lpage>1055</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zicari</surname>
          </string-name>
          .
          <article-title>Closed world databases opened through null values</article-title>
          .
          <source>In VLDB'88</source>
          , pages
          <fpage>50</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hernich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schweikardt</surname>
          </string-name>
          .
          <article-title>CWA-solutions for data exchange settings with target dependencies</article-title>
          .
          <source>In PODS'07</source>
          , pages
          <fpage>113</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lipski</surname>
          </string-name>
          .
          <article-title>Incomplete information in relational databases</article-title>
          .
          <source>J. ACM</source>
          <volume>31</volume>
          (
          <year>1984</year>
          ),
          <fpage>761</fpage>
          -
          <lpage>791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ph</surname>
          </string-name>
          . Kolaitis.
          <article-title>Schema mappings, data exchange, and metadata management</article-title>
          .
          <source>In PODS</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>Data exchange and incomplete information</article-title>
          .
          <source>In PODS'06</source>
          , pages
          <fpage>60</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>