<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Loss in Query Reformulation in Dynamic Distributed Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bruno F. F. Souza</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria C. M. Batista</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana Carolina Salgado</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal Rural University of Pernambuco, Informatics Department</institution>
          ,
          <addr-line>Pernambuco</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Federal University of Pernambuco, Center for Informatics</institution>
          ,
          <addr-line>Pernambuco</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <fpage>220</fpage>
      <lpage>224</lpage>
      <abstract>
        <p>Dynamic environments are descentralized systems that provide users with querying capabilities over a set of heterogeneous, distributed and autonomous data sources. Data Integration Systems, Peer Data Management Systems (PDMS) and Dataspaces are examples of such systems. They are composed by data sources (peers) that belong to a specific domain and are linked to each other by mappings (correspondences). Nonetheless, a challenge inherent to dynamic environments is to analyze the semantic loss during query reformulation. A semantic loss may occur when a query is reformulated from a peer to another in the system. To minimize the consequences of this problem, we propose the use of information quality criteria to help the semantic loss analysis. The semantic loss analysis is a step executed to verify the query routing possibilities.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic loss</kwd>
        <kwd>dynamic environments</kwd>
        <kwd>PDMS</kwd>
        <kwd>information quality</kwd>
        <kwd>quality criteria</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Nowadays, there is a demand for high-level integration of autonomous and
heterogeneous data sources through the development of distinct types of distributed
environments, including Data Integration Systems [
        <xref ref-type="bibr" rid="ref1 ref16">1</xref>
        ], Peer Data Management Systems
(PDMS) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Dataspaces [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These dynamic environments (DEs) are composed
by various autonomous data sources (e.g. site, files, database) referred here as peers,
which maintain information about a certain domain and which are linked to other
peers by mappings (i.e. associations between schemas) called hereafter as
correspondences.
      </p>
      <p>In a PDMS, when a user poses a query at a peer, the query is executed in that peer
and then reformulated to its neighbors’ peers in order to acquire more information.
This reformulation process may lead to the query degradation, i.e., the query suffers
some transformations in a way that the concepts used in the query will not be present
in the target peer, in another words, the concepts will be left out during the
reformulation process among peers.. The semantic loss is one aspect of query degradation.</p>
      <p>
        DEs still suffer with inadequate control mechanisms to address, for instance, the
quality of the query answers as well as the quality of the generated correspondences
between peer schemas. Including Information Quality (IQ) analysis in a DE improves
systems processes such as query evaluation and peer clustering. IQ is usually
characterized via multiple criteria, each of which captures a high-level aspect of quality. The
role of each one is to assess and measure a specific IQ dimension [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ]. Thus, quality
metrics are used to measure a particular quality criterion.
      </p>
      <p>The goal of this work is to show the idea of using IQ for helping the semantic loss
analysis after query reformulation in PDMSs. For this purpose, we propose two IQ
criteria to analyze such a loss and provide this information to improve query routing
process. This paper is organized as follows: Section 2 provides an example of query
routing; Section 3 introduces the semantic loss problem; Section 4 considers IQ
criteria for semantic loss analysis, Section 5 discusses the related work and finally, Section
6 points out some considerations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Query Routing in Dynamic Environments</title>
      <p>
        A DE has a set of autonomous peers that offer information and services to be
shared. There is an issue that rises in this scenario. When a query is posed by the user
how the system can choose the best possible peer or group of peers to send that query
to in order to retrieve relevant information? This process is called query routing and
has been addressed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As an example of query routing and query reformulation,
let’s analyze the hypothetical research center PDMS depicted in Figure 1. The arrows
indicate schema correspondences between connected peers, which are used to
reformulate the query (to transform a query based on one schema to another schema) over
the peers’ immediate neighbor, and so on. In this illustration, consider a user in Brazil
that poses query QB based on his/her local schema. QB will first be reformulated to
peer Portugal, according to the set of correspondences COB-P. Then, peer Portugal
should decide to which peer reformulates the query in order to retrieve the best
possible results. There are two possible paths to follow: France and Germany. The query
routing mechanism is responsible for dealing with such issue and, in this case, if the
query reformulation process generates semantic loss semantic loss it may
compromising in a bad way the query result (imprecise answers for example).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Semantic Loss</title>
      <p>
        As shown in the previous section a peer should reformulate a query to its neighbors
which, in most of the cases, have a different schema. This process is done recursively
for many peers in the network. In this sense, the query may loss some of its
significance due to the reformulations over different peers’ schemas. This problem is called
semantic loss and is stated as follows [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: Let suppose that a query Q is initially
submitted over peer P1 schema. Then the query is reformulated to Q’ over peer P2
schema. Let also Q’’ be a reformulation of Q’ from P2 to P1. The difference Q – Q’’ is
called the semantic loss of the original query Q when submitted to P2.
      </p>
      <p>The semantic loss is detected by comparing two queries, the original one Q, and
the reversed one Q’’. There exist two ways of comparison: to compare the queries
syntactically, and the syntactic differences lead to estimate the semantic loss. A
second way to compare Q and Q’’ is to verify their results. It is important to point out
that the semantic loss does not occur every time. If a peer Pj has the same schema of
Pk, there is no semantic loss in reformulating a query from Pj to Pk and vice versa.</p>
      <p>In this work, we work on the semantic loss problem analysis as follows: how to use
IQ criteria to help the detection and minimize the semantic loss in a PDMS network?
We use IQ criteria analysis to track such loss and also use it to guide the query routing
process. The idea is to calculate the semantic loss after reformulating a query and
send the reformulated query only to peers in which the evaluated semantic loss is
acceptable. The next section explains the IQ criteria used to evaluate the semantic
loss.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Quality Criteria for Semantic Loss Analysis</title>
      <p>We state our approach in the following: let us suppose that a peer P1 will
reformulate a query Q to neighbors P2 or P3. If P3 has a higher degree of information
completeness than P2, the semantic loss in reformulating Q to P3 is smaller than
reformulating to P2. The IQ criteria that compose the information completeness of a peer are:
data completeness and schema completeness. In the following, we provide the
definitions of these two criteria as well as how they can be evaluated.</p>
      <p>
        Data Completeness: due to dynamicity, query answers in a PDMS may not be
complete, considering its original definition (data completeness is typically understood as
the ratio of answer set size to the total amount of known data [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), which requires the
knowledge of the total amount of data in the system and relies on the closed world
assumption. Instead, peer schemas in the set of available peers have an open-world
assumption [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], i.e., the data returned by querying these peer may be incomplete. In
this light, data completeness in PDMSs may be defined as the ratio between received
results and the existing suitable data belonging to the available peers at query
answering time. Thus, we define that the completeness of a peer Pj for a query Q originally
submitted from a peer Pi (Pj is a neighbor of Pi) is calculated by the formula:
(1)
where QPiPj tuples is the number of tuples returned by peer Pj for query Q
reformulated from Pi to Pj;
QPiPktuples is the number of tuples returned by peer Pk for query Q
reformulated from Pi to Pk;
n is the number of Pi neighbors and;
the set of peers Pk (1 ≤ k ≤ n) are Pi neighbors.
Schema Completeness: is the degree to which entities and properties of the peer are
not missing when related with to the entities and properties requested in a submitted
query Q. In routing a query Q from peer Pi to peer Pj the completeness of Pj when
related to query Q may be assessed by taking the ratio between the number of schema
elements queried Q and number of elements held by Pj as in Formula 2:
where Qelements is the number of schema elements present in Q and;
is the number of Pj schema elements.
      </p>
      <p>Our proposal is to use these IQ criteria to analyze semantic loss in query
reformulation and identify routing possibilities. In a practical way, we intend to compute these
criteria dynamically in order to know whether it is worth to send the query to peers
based on their IQ score (value). For example, if a peer has low information
completeness score it probably means the query will suffer a semantic loss, otherwise the query
may be sent without any or less semantic loss. Moreover, the loss of semantic in a
query may be used as a criterion to stop query routing process.
(2)
5</p>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        Semantic query reformulation has attracted significant attention. The work of
Bonifati [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] provides a ‘relevance’ concept of a query wrt a mapping based on AF-IMF
metric. This metric takes into account the semantic proximity between the query and
the local and external mappings thus, creating only relevant mappings and minimizing
the semantic loss. Delveroudis [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] shows an algorithm that estimates the semantic loss
of rewritten queries based on the notion of containment queries. This information is
used as the basis for extending the schema mappings and improving the quality of
retrieved answers. A formal definition of semantics of query answering is presented in
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In this work, the authors show an algorithm that preserves semantics and reduces
semantic loss among query reformulation. The authors in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] highlight that the lack
of IQ analysis may contribute to information loss as well as the completeness of query
answers.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>In this paper we addressed the problem of semantic loss that affects the
propagation of queries in DE, more specifically in PDMS. We illustrated the query routing
process and how a query should be reformulated in a PDMS environment. We believe
that the analysis of IQ criteria plays an important role in detection of query
degradation by taking into account the IQ scores in query reformulation process. To this end,
we showed two IQ criteria we consider relevant to that analysis and described how
they may be assessed. This evaluation may improve the overall query reformulation
process in terms of retrieving relevant information from peers as well as routing query
only to peers that provide meaningful information. Currently, we are specifying the
IQ criteria and preparing our environment to implement and test the results of our
approach in a PDMS called SPEED1.We also plan to study the use of other criteria to
enrich the semantic loss analysis.
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Halevy</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ordille</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Data Integration: The Teenage Years</article-title>
          .
          <source>In: 32nd VLDB</source>
          , Volume
          <volume>32</volume>
          , p.
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Herschel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heese</surname>
          </string-name>
          , R.:
          <article-title>Humboldt Discoverer: A Semantic P2P Index for PDMS</article-title>
          .
          <source>In: Proc. of the Internacional Workshop Data Integration and the Semantic Web</source>
          ,
          <string-name>
            <surname>Portugal</surname>
          </string-name>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : From Databases to Dataspace:
          <article-title>A New Abstraction for Information Management</article-title>
          .
          <source>In: SIGMOD</source>
          , Volume
          <volume>34</volume>
          , p.
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          , (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dustdar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pichler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savenkov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Truong</surname>
          </string-name>
          , H.:
          <article-title>Quality-aware Service-Oriented Data Integration: Requirements, State of the Art and Open Challenges</article-title>
          .
          <source>In: ACM SIGMOD Record</source>
          , Volume
          <volume>41</volume>
          , p.
          <fpage>11</fpage>
          -
          <lpage>19</lpage>
          , (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Batista</surname>
            ,
            <given-names>M. C. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salgado</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          :
          <article-title>Information Quality Measurement in Data Integration Schemas</article-title>
          .
          <source>In: 5th QDB</source>
          , p.
          <fpage>61</fpage>
          -
          <lpage>72</lpage>
          ,
          <string-name>
            <surname>Viena</surname>
          </string-name>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ismail</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quafafou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Durand</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nachouki</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajjar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Queries Mining for Efficient Routing in P2P Communities</article-title>
          . In: IJDMS, Vol.
          <volume>2</volume>
          , No.
          <volume>1</volume>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arruda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salgado</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tedesco</surname>
            ,
            <given-names>P. C. A. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kedad</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Using Semantics to Enhance Query Reformulation in Dynamic Environments</article-title>
          .
          <source>In: 13th ADBIS</source>
          , p.
          <fpage>78</fpage>
          -
          <lpage>92</lpage>
          , Riga, Letônia (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Delveroudis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lekeas</surname>
            ,
            <given-names>P. V.</given-names>
          </string-name>
          :
          <article-title>Managing Semantic Loss during Query Reformulation in PDMS</article-title>
          . In: SWOD IEEE, p.
          <fpage>51</fpage>
          -
          <lpage>53</lpage>
          . (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Delveroudis</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lekeas</surname>
            ,
            <given-names>P. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Souliou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>On Estimating Semantic Loss in Peer Data Management Systems</article-title>
          .
          <source>In: AP2PS IEEE</source>
          , p.
          <fpage>51</fpage>
          -
          <lpage>53</lpage>
          . NTUA,
          <string-name>
            <surname>Greece</surname>
          </string-name>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Karnstedt</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>M.</given-names>
            HaB, M.
          </string-name>
          <string-name>
            <surname>Hauswirth</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
          </string-name>
          .:
          <article-title>Approximating Query Completeness by Predicting the Number of Answers in DHT-based Web Applications</article-title>
          .
          <source>In: 10th ACM WIDM</source>
          , p.
          <fpage>71</fpage>
          -
          <lpage>78</lpage>
          , (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.C.</given-names>
            <surname>Freytag</surname>
          </string-name>
          , U. Leser.:
          <article-title>Completeness of Integrated Information Sources</article-title>
          .
          <source>In: Inform. Systems</source>
          <volume>29</volume>
          (
          <issue>7</issue>
          ), p.
          <fpage>583</fpage>
          -
          <lpage>615</lpage>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. I.
          <string-name>
            <surname>Tatarinov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Halevy</surname>
          </string-name>
          .:
          <article-title>Efficient Query Reformulation in Peer-Data Management Systems</article-title>
          . In: ACM SIGMOD, p.
          <fpage>539</fpage>
          -
          <lpage>550</lpage>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Summa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Pacitti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Draidi</surname>
          </string-name>
          .:
          <article-title>Semantic Query Reformulation in Social PDMS</article-title>
          . In:
          <string-name>
            <surname>CoRR</surname>
            <given-names>ABS</given-names>
          </string-name>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.Q.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.V.S.</given-names>
            <surname>Lakshmanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pottinger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chung</surname>
          </string-name>
          .
          <article-title>Schema mapping and query translation in heterogeneous p2p xml databases</article-title>
          .
          <source>VLDB J</source>
          .,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>231</fpage>
          -
          <lpage>256</lpage>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Hose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Zeitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Sattler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Naumann</surname>
          </string-name>
          .:
          <article-title>A Research Agenda for Query Processing in Large-Scale Peer Data Management Systems</article-title>
          .
          <source>In: Information Systems</source>
          , (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>1 The SPEED Project (http://www</article-title>
          .cin.ufpe.br/~speed/)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>