<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Preliminary results on Ontology-based Open Data Publishing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianluca Cima</string-name>
          <email>cima@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Ingegneria Informatica, Automatica e Gestionale Sapienza Universita` di Roma</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Despite the current interest in Open Data publishing, a formal and comprehensive methodology supporting an organization in deciding which data to publish and carrying out precise procedures for publishing high-quality data, is still missing. In this paper we argue that the Ontology-based Data Management paradigm can provide a formal basis for a principled approach to publish highquality, semantically annotated Open Data. We describe two main approaches to using an ontology for this endeavor, and then we present some technical results on one of the approaches, called bottom-up, where the specification of the data to be published is given in terms of the sources, and specific techniques allow deriving suitable annotations for interpreting the published data under the light of the ontology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In many aspects of our society there is growing awareness and consent on the need
for data-driven approaches that are resilient, transparent and fully accountable. But to
achieve a data-driven society, it is necessary that the data needed for public goods are
readily available. Thus, it is no surprising that in recent years, both public and private
organizations have been faced with the issue of publishing Open Data, in particular with
the goal of providing data consumers with suitable information to capture the
semantics of the data they publish. Significant efforts have been devoted to defining
guidelines concerning the management and publication of Open Data. Notably, the W3C1
has formed a working group, whose objective is the release of a first draft on Open
Data Standards2. The focus of the document are areas such as metadata, data formats,
data licenses, data quality, etc., which are treated in very general terms, with no
reference to any specifical technical methodology. More generally, although there are several
works on platforms and architectures for publishing Open Data, there is still no formal
and comprehensive methodology supporting an organization in (i) deciding which data
to publish, and (ii) carrying out precise procedures for publishing and documenting
high-quality data. One of the reasons of this lack of formal methods is that the
problem of Open Data Publishing is strictly related to the problem of managing the data
within an organization. Indeed, a necessary prerequisite for an organization for
publishing relevant and meaningful data is to be able to manage, maintain and document
its own information system. The recent paradigm of Ontology-based Data Management
(OBDM) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (used and experimented in practice in the last years, see, e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) is an
attempt to provide the principles and the techniques for addressing this challenge. An
OBDM system is constituted by an ontology, the data sources forming the information
system, and the mapping between the ontology and the sources. The ontology is a
formal representation of the domain underlying the information system, and the mapping
is a precise specification of the relationship between the data at the sources and the
concepts in the ontology.
      </p>
      <p>In this paper we argue that the OBDM paradigm can provide a formal basis for
a principled approach to publish high-quality, semantically annotated Open Data. The
most basic task in Open Data is the extraction of the correct content for the dataset(s) to
be published, where by “content” we mean both the extensional information (i.e., facts
about the domain of interest) conveyed by the dataset, and the intensional knowledge
relevant to document such facts (e.g., concepts that intensionally describe facts), and
“correct” means that the aspect of the domain captured by the dataset is coherent with
a requirement formally expressed in the organization.</p>
      <p>Current practices for publishing Open Data focus essentially on providing
extensional information (often in very simple forms, such as CSV files), and they carry out
the task of documenting data mostly by using metadata expressed in natural languages,
or in terms of record structures. As a consequence, the semantics of datasets is not
formally expressed in a machine-readable form. Conversely, OBDM opens up the
possibility of a new way of publishing data, with the idea of annotating data items with the
ontology elements that describe them in terms of the concepts in the domain of the
organization. When an OBDM is available in an organization, an obvious way to proceed to
Open Data publication is as follows: (i) express the dataset to be published in terms of
a SPARQL query over the ontology, (ii) compute the certain answers to the query, and
(iii) publish the result of the certain answer computation, using the query expression
and the ontology as a basis for annotating the dataset with suitable metadata expressing
its semantics. We call such method top-down. Using this method, the ontology is the
heart of the task: it is used for expressing the content of the dataset to be published (in
terms of a query), and it is used, together with the query, for annotating the published
data.</p>
      <p>
        Unfortunately, in many organizations (for example, in Public Administrations) it
may be the case that people are not ready yet to manage their information systems
through the OBDM paradigm. In these cases, the bottom-up approach could be more
appropriate. For example, in the Italian Public Administration system, it is very unlikely
that local administration people are able to express their queries over the ontology using
SPARQL. Typically, the ontology and the mapping have been designed by third parties,
with no or little involvement with IT people responsible of the local administration
information system. In other words, these people probably cannot follow the top-down
approach, and they are more confident to express the specification of the dataset to be
published directly in terms of the source structures (i.e., the relational tables in their
databases), or, more generally, in terms of a view over the sources. But how can we
automatically publish both the content and the semantics of the dataset if its specification
is given in terms of the data sources? We argue that we can achieve this goal by
following what we call the bottom-up approach: the organization expresses its publishing
requirement as a query over the sources, and, by using the ontology and the mapping,
a suitable algorithm computes the corresponding query over the ontology. With such
query at hand, we have reduced the problem in such a way that the top-down approach
can now be followed, and the required data can be published according to the method
described above. So, at the heart of the bottom-up approach there is a conceptual issue
to address:
”Given a query Q over the sources, which is the query over the ontology that
characterizes Q at best (independently from the current source database)?”
Note that the answer to this question is relevant also for other tasks related to the
management of the information system, e.g., the task of explaining the semantics of the
various data sources within the organization. The question implicitly refers to a sort
of reverse engineering problem, which is a novel aspect in the investigation of both
OBDM and data integration. Indeed, most of (if not all) the literature about
managing data sources through an ontology (see, e.g., [
        <xref ref-type="bibr" rid="ref18 ref5">5, 18</xref>
        ]), or more generally, about data
integration [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] assume that the user query is expressed over the global schema, and
the goal is to find a rewriting (i.e., a query over the source schema) that captures the
original query in the best way, independently from the current source database. Here,
the problem is reversed, because we start with a source query and we aim at deriving a
corresponding query over the ontology, called a source-to-target rewriting.
      </p>
      <p>In this paper we study the above described bottom-up approach, and provide the
following contributions.</p>
      <p>– We introduce the concept of source-to-target rewriting (see Section 3), the main
technical notion underlying the bottom-up approach, and we describe two
computation problems related to it, namely the recognition problem, and the finding
problem. The former aims at checking whether a query over the ontology is a
source-to-target rewriting of a given query over the sources, taking into account
the mapping between the sources and the ontology. The latter aims at computing
a suitable source-to-target rewriting of a given source query, with respect to the
mapping.
– We discuss two different semantics for source-to-target rewritings, one based on
the logical models of the OBDM specification, and one based on certain answers.
The former is somehow the natural choice, given the first-order semantics behind
OBDM. The latter is a significant alternative, that may better capture the intuition
of a user who is accustomed to think of query semantics in terms of certain answers.
– We show that, although the ideal notion is the one of “exact” source-to-target
rewriting, it is important to resort to approximations to exact rewriting when exactness
cannot be achieved. For this reason, we introduce the notion of sound and complete
source-to-target rewritings.
– For the case of complete source-to-target rewritings, we present algorithms both for
the recognition (Section 4), and for the finding (Section 5) problem, in particular for
the setting where the ontology is expressed in DL-LiteA;id, and the queries involved
in the specification are conjunctive queries.</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        We assume familiarity with classical databases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Description Logics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and the
OBDM paradigm. In this section, we (i) review the most basic notions of non-ground
instances, and their correlation with conjunctive queries; (ii) briefly discuss the chase
of a possible non-ground instance; (iii) discuss the relevant aspects of notation we use
in the following regarding the OBDM paradigm.
      </p>
      <p>
        For a possible non-ground instance D, we assume that each value in dom(D), i.e.,
the set of values occurring in D, comes from the union of two fixed disjoint infinite
sets: the set Const of all constants, and the set NullD of all labeled nulls. We also let
null (D) := dom(D) \ NullD. In particular, each labeled null in a non-ground instance
is treated as an unknown value (and hence, an incomplete information), rather than to
a non-existent value [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Thus, a non-ground instance represents a number of ground
instances obtained by assigning constants to each labeled null. More precisely, let D be
a non-ground instance, and v be a mapping v : null(D) ! Const. Then, v is called
a valuation of D, and we indicate, with v(D), the ground instance obtained from D
by replacing elsewhere each labeled null x 2 D with v(x). We also extend this to
tuples, that is, given a tuple u = (u1; :::; un) of both constants and labeled nulls, with
v(u) we indicate the tuple (v0(u1); :::; v0(un)), where v0(ui) = ui if ui is a constant;
otherwise (ui is a labeled null), v0(ui) = v(ui). Given an instance D it is possible to
construct in linear time a boolean CQ qD that fully captures it, and vice versa. We also
let qD(x) denoting the transformation of qD by removing the existential quantification
of the variables x in qD. Moreover, given a non-boolean CQ q (with x as distinguished
variables), we associate to it the instance Dq by considering the variables in x as if they
were existentially quantified. For ease of presentation, we extend CQs to allow also
queries of the form fx j ?(x)g and fx j &gt;(x)g, with their usual meaning. We also
denote with tup(q) the tuple composed by the terms in head of q.
      </p>
      <p>Given a source schema S; a target schema T ; a set M of st-tgds (i.e., assertions of
the form 8x; y( (x; y) ! 9z'(x; z)), where is a CQ over S, and ' is a CQ over
T); and a set t of egds (i.e., assertions of the form 8x( T(x) ! (x1 = x2)), where</p>
      <p>
        T is a CQ over T, and x1; x2 are among the variables in x), the chase procedure of a
possibly non-ground source instance D consists in: (i) the chase of D w.r.t. M, where,
for every st-tgd (x; y) ! 9z'(x; z) in M and for every pair of tuples (a; b) such that
D j= (a; b), there is the introduction of new facts in the instance J of the target schema
T so that '(a; u) holds, where u consists in a fresh tuple of distinct labeled nulls coming
from an infinite set Null disjoint from NullD; (ii) the chase of J w.r.t. t, where, for
every egd 8x( T(x) ! (x1 = x2)) and for every tuple a such that J j= T(a) and
a1 6= a2, we equate the two terms. Equating a1 with a2 means choosing one of the two
so that the other is replaced elsewhere in J by the one chosen. In particular, if one is a
labeled null and the other is a constant, then the chase choose the constant; if both are
labeled nulls, one coming from NullD and the other from Null , it always choose the
one coming from NullD; if both are constants, then the chase fail. Moreover, with we
denote the set of equalities applied by the chase of J w.r.t. a set of egds on variables
coming from NullD. This can be done by keeping track of the substitution applied by
the chase. For example, if the chase equates the variable y 2 NullD with the variable
x 2 NullD, and then equates the variable z 2 NullD with the variable w 2 NullD, and
then w with the constant a, given the tuple (x; y; z; w), (x; y; z; w) indicates the tuple
(x; x; a; a). Note that, we can compute the certain answers of a boolean union of CQs
(UCQ) q with at most one inequality per disjunct by splitting q as a boolean UCQ q16=
with exactly one inequality per disjunct, and a boolean UCQ q06= with no inequality per
disjunct. The key idea is that the negation of q16= consists in a set of egds, hence, the
certain answers of q can be computed by applying the chase procedure over the instance
J (i.e., the instance produced by the chase of C w.r.t. M and t) w.r.t. :q16=, where, if
the chase fail then the answer is true; otherwise, if the instance J 0 produced satisfy one
of the conjunctive query in q06=, then the answer is true, else the answer is false. We
refer to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for more details.
      </p>
      <p>Given an OBDM specification I = hO; M; Si, where O is a TBox, and M is a set
of st-tgds, and given a non-ground source instance D for S, and a set of egds t, we
denote with AD; , where = M [ t, the ABox computed as follows: (i) chase
the non-ground source instance D w.r.t. ; (ii) freeze the instance (or equivalently, the
ABox with variables) obtained, i.e., variables in this instance are now considered as
constant. Note that, such ABox AD; may also not exists due to the failing of the chase,
in this case, we denote AD; with the symbol ?.</p>
      <p>
        For an OBDM specification I = hO; M; Si, and for a source database C for I
(i.e., a ground instance over the schema S), we denote by semC(I) the set of
models B for I relative to C such that: (i) B j= O; (ii) (B; C) j= M. Given a query q
over O, we denote by cert (q; I; C) the set of certain answers to q in I relative to
C. It is defined as: cert (q; I; C) = TfqB j B 2 semC(I)g if semC(I) 6= ;;
otherwise, cert (q; I; C) = AllTup(q; C), where AllTup(q; C) is the set of all possible
tuples of constants in C whose arity is the one of the query q. Furthermore, given a
DL-LiteA;id [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] TBox O and a DL-LiteA;id ABox A we are able to: (i) check whether
hO; Ai is satisfiable by computing the answers of a suitable boolean query Qsat (a UCQ
with at most one inequality per disjunct) over the ABox A considered as a relational
database. We see Qsat as the union of Qs0a6=t (the UCQ containing every disjunct not
1=
comprising inequalities in Qsat ) and Qsa6t (the UCQ containing every disjunct
comprising inequalities in Qsat ); (ii) compute the certain answers to a UCQ Qg over a
satisfiable hO; Ai, denoted with cert (Qg; hO; Ai), by producing a perfect
reformulation (denoted as a function pr ( )) of such query, and then computing the answers of
pr (Qg) over the ABox A considered as a relational database. See [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for more details.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The notion of source-to-target rewriting</title>
      <p>In what follows, we implicitly refer to (i) an OBDM specification I = hO; M; Si; (ii)
a query Qs over the source schema S; (iii) a query Qg over the ontology O.</p>
      <p>As we said in the introduction, there are at least two different ways to formally
define a source-to-target rewriting (s-to-t rewriting in the following) for each of the
three variants, namely “exact”, “complete”, and “sound”. The first one is captured by
the following definition.</p>
      <p>Definition 1. Qg is a complete (resp., sound, exact) s-to-t rewriting of Qs with respect
to I under the model-based semantics, if for each source database C and for each model
B 2 semC(I), we have that QsC QgB (resp., QgB QsC, QgB = QsC).</p>
      <p>Intuitively, a complete s-to-t rewriting of Qs w.r.t. I under the model-based
semantics is a query over O that, when evaluated over a model B 2 semC(I) for a source
database C, returns all the answers of the evaluation of Qs over C. In other words, for
every source database C, the query Qg over O captures all the semantics that Qs expresses
over C. Similar arguments hold for the notions of sound and exact s-to-t rewriting
under this semantics. Moreover, from the formal definition of source-to-target rewriting
and the usual definition of target-to-source rewriting (simply called rewriting) used in
data integration, it is easy to see that Qg is a complete (resp., sound) source-to-target
rewriting of Qs w.r.t. I under the model-based semantics, if and only if Qs is a sound
(resp., complete) rewriting of Qs w.r.t. I, implying that, Qg is an exact source-to-target
rewriting of Qs w.r.t. I under the model-based semantics, if and only if Qs is an exact
rewriting of Qg w.r.t. I.</p>
      <p>The second possible way to formally define a source-to-target rewriting is as
follows.</p>
      <p>Definition 2. Qg is a complete (resp., sound, exact) s-to-t rewriting of Qs with respect
to I under the certain answers-based semantics, if for each source database C such
that semC(I) 6= ;, we have that QsC cert (Qg; I; C) (resp., cert (Qg; I; C) QsC,
QsC = cert (Qg; I; C)).</p>
      <p>
        In this new semantics, in order to capture a query Qs over S, we resort to the notion
of certain answers. Indeed, a complete s-to-t rewriting of Qs w.r.t. I under the certain
answers-based semantics is a query over O such that, when we compute its certain
answers for a source database C, we get all the answers of the evaluation of Qs over
C. As before, similar arguments hold for the notions of sound and exact s-to-t
rewriting under this semantics. Note also the strong correspondence between the exact s-to-t
rewriting under the certain answers-based semantics and the notion of perfect rewriting.
We remind that a perfect rewriting of Qg w.r.t. I is a query Qs over S that computes
cert (Qg; I; C) for every source database C such that semC(I) 6= ; [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Indeed, we
have that Qg is an exact s-to-t rewriting of Qs w.r.t. I under the certain answers-based
semantics if and only if Qs is a perfect rewriting of Qg w.r.t. I. Note that the above
observations imply that the two semantics are indeed different, since it is well-known
that the two notions of exact rewriting and perfect rewriting of Qg w.r.t. I are different.
The difference between the two semantics is confirmed by the following example.
Example 1. O := ; (i.e., no TBox assertions in O); S contains a binary relation r1 and
a unary relation r2; M := f8x8y(r1(x; y) ! G(x; y)); 8x(r2(x) ! 9Y:G(x; Y ))g;
Qs := f(w) j 9Z:r1(Z; w)g; Qg := f(w) j 9Z:G(Z; w)g.
      </p>
      <p>It is easy to see that Qg is a sound s-to-t rewriting of Qs w.r.t. I under the certain
answers-based semantics (more precisely, it is an exact s-to-t rewriting of Qs w.r.t. I
under such semantics), while it is not sound under the model-based semantics. In fact,
for the source database C with r1C = f(a; b)g and r2C = f(c)g, and for the model B with
GB = f(a; b); (c; d)g, we have B 2 semC(I), and QgB 6 QsC. tu
Intuitively, for the sound case, the model-based semantics is too strong, in the sense that
under such semantics, a model B may contain not only facts depending on how data in
the source C are linked to O through M, but additionally arbitrary facts, with the only
constraint of satisfying O. One might think that, in order to address this issue, it is
sufficient to resort to a sort of minimizations of the models of O. Actually, the above
example shows that, even if we restrict the set of models to the set of minimal models
(i.e., models B such that (i) B 2 semC(I) and (ii) there is no model B0 2 semC(I) such
that B0 B), and adopt a semantics like the model-based one but restricted to the set
of minimal models, Qg is still not a sound s-to-t rewriting (this can be seen considering
that the target database B defined earlier is a minimal model).</p>
      <p>Observe that the above considerations show the difference in the two semantics by
referring to sound and exact s-to-t rewritings. It is interesting to ask whether the
difference shows up when restricting our attention to complete rewritings. The following
proposition deals with this question.</p>
      <p>Proposition 1. Qg is a complete s-to-t rewriting of Qs with respect to I under the
model-based semantics if and only if it is so under the certain answers-based semantics.
Proof (Sketch). One direction is trivial. Indeed, when Qg is a complete s-to-t rewriting of
Qs with respect to I under the model-based semantics, by definition of certain answers,
for each source database C such that semC(I) 6= ; we have that QsC cert (Qg; I; C).
For the other direction, suppose that Qg is not a complete s-to-t rewriting of Qs w.r.t. I
under the model-based semantics. It follows that, there exists a source database C and a
model B 2 semC(I) such that QsC 6 QgB, implying that, QsC 6 cert (Qg; I; C), which,
in turn, implies that Qg is not a complete s-to-t rewriting of Qs w.r.t. I under the certain
answers-based semantics.
tu</p>
      <p>Obviously, the query over the ontology which captures at best a given query q over
the source schema is the exact s-to-t rewriting of q. However, the following example
shows that even for very simple OBDM specifications, an exact s-to-t rewriting of even
trivial queries, may not exist.</p>
      <p>Example 2. O := ; (i.e., no TBox assertions in O); S contains two unary relations man
and woman; M := f8x(man(x) ! Person(x)); 8x(woman(x) ! Person(x))g;
Qs := f(x) j woman(x)g.</p>
      <p>It is possible to show that the only sound s-to-t rewriting of Qs w.r.t. I under both
semantics is the query Qg := f(x) j ?(x)g, which is obviously not a complete s-to-t
rewriting of Qs w.r.t. I neither under the model-based semantics, nor under the certain
answers-based semantics. On the other hand, the most immediate and intuitive complete
s-to-t rewriting of Qs w.r.t. I is the query Q0g := f(x) j Person(x)g. Furthermore, as
we will see in Section 5, this query is an “optimal” complete s-to-t rewriting of Qs w.r.t.
I, where the term optimal will be precisely defined. tu</p>
      <p>As we said in the introduction, in the rest of this paper we focus on complete
s-to-trewritings. In particular, we will address both the recognition problem (see Section 4),
and the finding problem (see Section 5) in a specific setting, characterized as follows:
– The ontology O in an OBDM specification I = hO; M; Si is expressed as a TBox
in DL-LiteA;id.
– The mapping M in I is a set of GLAV mapping assertions (or, st-tgds), where each
assertion expresses a correspondence between a conjunctive query over the source
schema and a conjunctive query over the ontology.
– In the recognition problem, both the query over the source schema and the query
over the ontology are conjunctive queries. Similarly, in the finding problem, the
query over the source schema is a conjunctive query.
4</p>
    </sec>
    <sec id="sec-4">
      <title>The recognition problem for complete s-to-t rewritings</title>
      <p>We implicitly refer to the setting described at the end of the previous section. The
recognition problem associated to the complete s-to-t rewriting is the following decision
problem: Given an OBDM specification I = hO; M; S i, a query Qs over the source schema
S , and a query Qg over the ontology O, check whether Qg is a complete s-to-t rewriting
of Qs with respect to I. The next lemma is the starting point of our solution.
Lemma 1. Qg is not a complete s-to-t rewriting of Qs with respect to I = hO; M; S i
if and only if there is a valuation v of DQs and a model B 2 sem v(DQs )(I) such that
v(tup(Qs)) 62 QgB.</p>
      <p>Proof. ”(=” Suppose that there exists a valuation v of DQs and a model B 2 sem v(DQs )(I)
such that v(tup(Qs)) 62 QgB. Obviously, v(tup(Qs)) 2 Qsv(DQs ). It follows that, there
exist a source database v(DQs ), a model B 2 sem v(DQs )(I), and a tuple v(tup(Qs))
such that v(tup(Qs)) 62 QgB and v(tup(Qs)) 2 Qsv(DQs ).</p>
      <p>”=)” Suppose that Qg is not a complete s-to-t rewriting of Qs w.r.t. I, i.e., there
is a source database C, a model B 2 sem C (I), and a tuple t such that t 2 QsC and
t 62 QgB. The fact that t 2 QsC implies the existence of a homomorphism v : DQs ! C
such that v(tup(Qs)) = t. Note also, that since C is a ground instance, v is a valuation
of DQs such that v(DQs ) C. Obviously, B 2 sem v(DQs )(I), this can be seen by
considering that (i) B j= O is true from the supposition that B 2 sem C (I); and (ii)
(B; v(DQs )) j= M is true by considering that, (B; C) j= M (which holds from the
supposition that B 2 sem C (I)), v(DQs ) C, and the queries in M are monotone
queries. It follows that, there is a valuation v of DQs and a model B 2 sem v(DQs )(I)
such that v(tup(Qs)) 62 QgB.</p>
      <sec id="sec-4-1">
        <title>Algorithm 1: CheckComplete(I, Qs, Qg ) Input: OBDM specification I = hO; S ; Mi, query Qs over S , query Qg over O.</title>
        <p>Output: true or false.
1: Compute DQs from Qs (i.e., the instance, possibly with incomplete information,
associated to the query Qs), and denote it with D.
2: Compute AD; , where = M [ :Qs1a6=t , and let</p>
        <p>to the variables in D by the chase.
3: If AD; = ?, then return true.
4: If the evaluation of Qs0a6=t over AD; considered as a relational database is f()g
(i.e., AD; j= Qs0a6=t ), then return true.
5: If (tup(Qs)) 2 cert (Qg ; hO; AD; i) then return true, else return false.
be the set of equality applied</p>
        <p>The next theorem establishes the correctness of the above algorithm.</p>
        <p>Theorem 1. CheckComplete(I, Qs, Qg) terminates, and returns true if and only if
Qg is a complete s-to-t rewriting of Qs w.r.t. I.</p>
        <p>Proof (Sketch). Termination of the algorithm easily follows by the termination of the
chase procedure, and by the obvious termination of computing the certain answers of a
CQ over hO; AD; i.</p>
        <p>For the ”=)” direction, suppose that the algorithm returns false, i.e., AD; 6j= Qs0a6=t ,
and (tup(Qs)) 62 cert (Qg; hO; AD; i). Now, if we extend (D) by considering the
freezing of this instance (i.e., variables are now considered as constants), it is easy to
see that we obtain a valuation v of D such that (v(D); AD; ) j= M, and such that
semv(D)(I) 6= ;. Moreover, the fact that (tup(Qs)) 62 cert (Qg; hO; AD; i), implies,
by the property of certain answers, that there is at least one model B j= hO; AD; i, and
hence B 2 semv(D)(I) (because (v(D); AD; ) j= M) such that v(tup(Qs)) 62 QgB. It
follows, from Lemma 1, that Qg is not a complete s-to-t rewriting of Qs w.r.t. I.</p>
        <p>For the ”(=” direction, in the cases that AD; = ? or AD; j= Qs0a6=t , it is easy
to see that for every valuation v of D, either the chase of v(D) will fail, or every
ABox A such that (v(D); A) j= M and A j= :Qs1a6=t will be such that A j= Qs0a6=t ,
implying that, for every valuation v of D, semv(D) = ;. It follows, from Lemma 1,
that in this case Qg is a complete s-to-t rewriting of Qs w.r.t. I. While, in the cases
that (tup(Qs)) 2 cert (Qg; hO; AD; i), it is easy to see that, for every valuation
v of D either semv(D) = ;, or if we compute Av(D); , we have that v(tup(Qs)) 2
cert (Qg; hO; Av(D); i). More generally, every A obtained by chasing v(D) w.r.t. M
1=
and :Qsa6t , and then choosing arbitrary constants for the possible remaining variables,
is such that v(tup(Qs)) 2 cert (Qg; hO; Ai). Hence, for every model B such that
B j= hO; Ai, we have that v(tup(Qs)) 2 QgB. Also, we observe that the set of models
semv(D) coincides with the set of all models B such that B j= hO; Ai for all the possible
ABox A obtained using the above procedure. It follows that, for every possible
valuation v of D and for every possible B 2 semv(D)(I), we have that v(tup(Qs)) 2 QgB,
implying, from Lemma 1, that also in this case Qg is a complete s-to-t rewriting of Qs
w.r.t. I. tu
As for complexity issues of the algorithm, we observe: (i) it runs in PTIME in the size
of Qs. Indeed, computing D (the instance associated to the query Qs) can be done in
linear time, and chasing an instance in the presence of a weakly acyclic set of tgds (as
in our case) is PTIME in the size of D (M and are considered fixed); (ii) it runs in
PTIME in the size of O. Indeed, Qsat and the evaluation of the certain answers of Qg
can be both computed in PTIME in the size of O; (iii) it runs in EXPTIME in the size of
M. This can be seen from the obvious EXPTIME process of transferring data from D to
AD; ; (iv) the problem is NP-complete in the size of Qg because computing the certain
answers of a UCQ query is NP-complete in the size of the query (query complexity).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Finding optimal complete s-to-t rewritings</title>
      <p>In this section we study the problem of finding optimal complete s-to-t rewritings. The
first question to ask is which rewriting we chose in the case where several complete
rewritings exist. The obvious choice is to define the notion of “optimal” complete s-to-t
rewriting: one such rewriting r is optimal if there is no complete s-to-t rewriting that is
contained in r. In order to formalize this notion, we introduce the following definitions
(where MOD(O) denotes the set of models of O).</p>
      <sec id="sec-5-1">
        <title>Definition 3. Qg is contained in Q0g with respect to O, denoted Qg</title>
        <p>every model B 2 MOD(O) we have that QgB
Q0g with respect to O, denoted Qg O Q0g, if Qg
B 2 MOD(O) we have that QgB Q0gB.</p>
        <p>Definition 4. Qg is an optimal complete s-to-t rewriting of Qs with respect to I, if Qg
is a complete s-to-t rewriting of Qs with respect to I, and there exists no query Q0g such
that Q0g is a complete s-to-t rewriting of Qs with respect to I and Q0g O Qg.
We are ready to present an algorithm for computing an optimal complete s-to-t rewriting
of a query over the source schema. For the termination and the complexity of this
algoO
Q0gB. Qg is proper contained in
O Q0g and for at least one model</p>
        <p>Q0g, if for</p>
      </sec>
      <sec id="sec-5-2">
        <title>Algorithm 2: FindOptimalComplete(I, Qs)</title>
        <p>Input: OBDM specification I = hO; S; Mi, CQ Qs over S.</p>
        <p>Output: query Qg over O.
1: Compute DQs from Qs (i.e., the instance, possibly with incomplete information, associated
to the query Qs), and denote it with D.
2: Chase D w.r.t. M to produce an instance J 0.
3: Chase J 0 w.r.t. :Qs1a6=t ; if the chase fails, then stop and return the query
ftup(Qs) j ?(tup(Qs))g; otherwise, let J be the instance produced, and let be the set of
equality applied to the variables in D by the chase.
4: Evaluate Qs0a6=t over J ; if the answer is f()g (i.e., J j= Qs0a6=t ), then stop and return the query
ftup(Qs) j ?(tup(Qs))g.
5: If J = ; (i.e., no atoms in the instance J ), then stop and return the query
ftup(Qs) j &gt;(tup(Qs))g; otherwise, let QJ be the boolean conjunctive query associated to
the instance J .
6: Let w be the tuple composed by all terms in (tup(Qs)) not appearing in J . If such tuple is
not empty, then return f (tup(Qs)) j QJ ( (tup(Qs)) ^ &gt;(w)g; otherwise, return
f (tup(Qs)) j QJ ( (tup(Qs)))g.
rithm hold the same considerations done for the termination and the complexity of the
CheckComplete algorithm. In particular, FindOptimalComplete(I,Qs) terminates,
and it runs in (i) PTIME in the size of Qs; (ii) PTIME in the size of O; (iii) EXPTIME
in the size of M. Whereas, the correctness is established by the next theorem.
Theorem 2. FindOptimalComplete(I, Qs) returns an optimal complete s-to-t
rewriting of Qs w.r.t. I.</p>
        <p>Proof (Sketch). When the algorithm returns the query ftup(Qs) j ?(tup(Qs))g, it is
easy to see that, regardless of which is the query Qg, if we run the algorithm
CheckComplete(I,Qs,Qg) it returns true (also in this case, either the chase will fail, or the
ABox AD; produced will satisfy Qs0a6=t ), and hence, by Theorem 1, Qg is a complete
s-to-t rewriting of Qs w.r.t. I. It follows that, also ftup(Qs) j ?(tup(Qs))g is a
complete s-to-t rewriting, and, by definition of such query, it is an optimal complete s-to-t
rewriting of Qs w.r.t. I.</p>
        <p>When the algorithm returns the query Qg = f (tup(Qs)) j QJ ( (tup(Qs)) ^
&gt;(w)g (or ftup(Qs) j &gt;(tup(Qs))g, in the case J = ;), if we run the algorithm
CheckComplete(I,Qs,Qg), it computes the ABox AD; , where tup(Qs) 2 cert (Qg; hO; AD; i)
holds because Qg corresponds exactly to AD; (before to be freezed) extended with
&gt;(w) for all terms w in (tup(Qs)) not appearing in AD; . It follows that, also in this
case, CheckComplete(I,Qs,Qg) returns true, implying, from Theorem 1, that Qg is a
complete s-to-t rewriting of Qs w.r.t. I.</p>
        <p>We now prove that the query Qg = f (tup(Qs)) j QJ ( (tup(Qs)) ^ &gt;(w)g (or
ftup(Qs) j &gt;(tup(Qs))g, in the case J = ;) is also an optimal complete s-to-t rewriting
of Qs w.r.t. I. In particular, suppose that there exist a query Q0g such that Q0g O Qg,
i.e., Q0g O Qg, and there is a model B 2 semC (I) and a tuple t such that t 2 QgB and
t 62 Q0gB. The fact that t 2 QgB implies the existence of a valuation v to all the variables
in Qg that makes Qg true in B. Note that, we can extend the valuation v by assigning
a new fresh constant to every variable appearing in D and not appearing in Qg. The
valuation v obtained is now a valuation for D, and obviously t 2 Qsv(D). Moreover, if we
apply the same valuation v to the instance J , it is easy to see that we obtain a ground
instance J 0 such that (v(D); J 0) j= M (we recall that Qg is the CQ associated to the
instance J ). Obviously, J 0 B, and hence, (v(D); B) j= M holds because queries in
the mapping M are monotone queries. Moreover, we also have that B 2 semv(D)(I)
(the fact that B j= O holds from the initial supposition). Hence, for the source database
v(D) there is a model B 2 semv(D)(I) and a tuple t such that t 2 Qsv(D) and t 62 Q0gB,
implying that, Q0g is not a complete s-to-t rewriting of Qs w.r.t. I. tu
It is easy to prove that the query returned by the algorithm is not only an optimal
complete s-to-t rewriting of Qs w.r.t. I, but it is also the unique (up to equivalence) optimal
complete s-to-t rewriting of Qs w.r.t. I. Furthermore, the above result implies that an
optimal complete s-to-t rewriting of Qs w.r.t. I can always be expressed as a CQ.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We have introduced the notion of Ontology-based Open Data Publishing, whose idea is
to use an OBDM specification as a basis for carrying out the task of publishing
highquality open data.</p>
      <p>In this paper, we have focused on the bottom-up approach to ontology-based open
data publishing, we have introduced the notion of source-to-target rewriting, and we
have developed algorithms for two problems related to complete source-to-target
rewritings, namely the recognition and the finding problem. We plan to continue our work on
several directions. In particular, we plan to investigate the notion of sound rewriting
under different semantics. Also, we want to study the top-down approach, especially with
the goal of devising techniques for deriving which intensional knowledge to associate
to datasets in order to document their content in a suitable way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hull</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vianu</surname>
          </string-name>
          . Foundations of Databases.
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Afrati</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          .
          <article-title>Answering aggregate queries in data exchange</article-title>
          . pages
          <fpage>129</fpage>
          -
          <lpage>138</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Antonioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Castano</surname>
          </string-name>
          `,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Grossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          , E. Virardi, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Castracane</surname>
          </string-name>
          .
          <article-title>Ontology-based data management for the italian public debt</article-title>
          . pages
          <fpage>372</fpage>
          -
          <lpage>385</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>F.</given-names>
            <surname>Baader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Patel-</surname>
          </string-name>
          Schneider, editors.
          <source>The Description Logic Handbook: Theory, Implementation and Applications</source>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Rodr´ıguez-</article-title>
          <string-name>
            <surname>Muro</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Ontologies and databases: The DL-Lite approach</article-title>
          . volume
          <volume>5689</volume>
          , pages
          <fpage>255</fpage>
          -
          <lpage>356</lpage>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Tractable reasoning and efficient query answering in description logics: The DL-Lite family</article-title>
          .
          <volume>39</volume>
          (
          <issue>3</issue>
          ):
          <fpage>385</fpage>
          -
          <lpage>429</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Path-based identification constraints in description logics</article-title>
          . pages
          <fpage>231</fpage>
          -
          <lpage>241</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          .
          <article-title>View-based query processing: On the relationship between rewriting, answering and losslessness</article-title>
          . volume
          <volume>3363</volume>
          , pages
          <fpage>321</fpage>
          -
          <lpage>336</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Chandra</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Merlin</surname>
          </string-name>
          .
          <article-title>Optimal implementation of conjunctive queries in relational data bases</article-title>
          . pages
          <fpage>77</fpage>
          -
          <lpage>90</lpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          .
          <article-title>Data exchange: Semantics and query answering</article-title>
          . pages
          <fpage>207</fpage>
          -
          <lpage>224</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Popa</surname>
          </string-name>
          .
          <article-title>Data exchange: Getting to the core</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <fpage>174</fpage>
          -
          <lpage>210</lpage>
          , mar
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hernich</surname>
          </string-name>
          .
          <article-title>Answering non-monotonic queries in relational data exchange</article-title>
          . pages
          <fpage>143</fpage>
          -
          <lpage>154</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hernich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Schweikardt</surname>
          </string-name>
          .
          <article-title>Closed world data exchange</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>36</volume>
          (
          <issue>2</issue>
          ):
          <volume>14</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          :
          <fpage>40</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Lipski</surname>
          </string-name>
          , Jr.
          <article-title>Incomplete information in relational databases</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>31</volume>
          (
          <issue>4</issue>
          ):
          <fpage>761</fpage>
          -
          <lpage>791</lpage>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          .
          <article-title>Data integration: A theoretical perspective</article-title>
          . pages
          <fpage>233</fpage>
          -
          <lpage>246</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          .
          <article-title>Ontology-based data management</article-title>
          . pages
          <fpage>5</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sirangelo</surname>
          </string-name>
          .
          <article-title>Data exchange and schema mappings in open and closed worlds</article-title>
          . pages
          <fpage>139</fpage>
          -
          <lpage>148</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>Linking data to ontologies</article-title>
          . X:
          <fpage>133</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          .
          <article-title>The complexity of relational query languages</article-title>
          . pages
          <fpage>137</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaniolo</surname>
          </string-name>
          .
          <article-title>Database relations with null values</article-title>
          .
          <source>In Proc. of PODS</source>
          , pages
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>