<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Incorporating Completeness Quality Support in Internet Query Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandra de F. Mendes Sampaio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro R. Falcone Sampaio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Informatics, University of Manchester</institution>
          ,
          <addr-line>Manchester M60 1QD</addr-line>
        </aff>
      </contrib-group>
      <fpage>17</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Measuring Model and Data Completeness of XML Data</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        There has been an exponential growth in the availability of data on the web and in
the usage of systems and tools for querying and retrieving web data. Despite the
considerable advances in search engines and other internet technologies for
dynamically combining, integrating and collating web data, supporting a DBMS-like
data management approach across multiple web data sources is still an elusive goal.
To buck this trend, internet query systems − IQS [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are being developed to enable
DBMS-like query processing and data management over multiple web data sources,
shielding the user from complexities such as information heterogeneity,
unpredictability of data source response rates, and distributed query execution.
      </p>
      <p>
        The comprehensive query processing approach supported by IQS allows users to
query a global information system without being aware of the sites structure, query
languages, and semantics of the data repositories that store the relevant data for a
given query [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite the significant amount of work in the development of the
data integration and distributed query processing capabilities, internet query systems
still suffer from inadequate data quality control mechanisms to address the
management of quality of the data retrieved and processed by the IQS. Typical
examples of data quality issues [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that need to be addressed when supporting quality
aware query processing over multiple web data sources are: Accuracy, Completeness
and Timeliness.
      </p>
      <p>
        We are currently investigating how an internet query system can be extended to
support a dynamic data quality aware query processing framework. In particular, we
are developing Completeness extensions for the Niagara Internet Query System.
Completeness is a context-dependent data quality dimension that refers to “the extent
to which data are of sufficient breadth, depth and scope for the task at hand” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
the context of a database model, two types of completeness dimensions are
considered: model completeness and data completeness. Model completeness refers to
the measure of how appropriate the schema of the database is for a particular
application; data completeness refers to the measurable errors of omission observed
between the database and its schema, checking, for example, if a database contains all
entities/attributes specified in the schema. Completeness issues arising in database
applications may have several causes, for example, discrepancies between the intent
for information querying and the collected data, partial capture of data semantics
during data modeling, and the loss of data resulting from data exchange. Potential
approaches to address completeness issues include removing entities with missing
values from the database; replacing missing values with default values, and
completing missing values with data from other sources. Irrespective of the approach
taken to deal with poor data completeness, it is crucial that database users formulating
queries across multiple data sources are able to judge if a particular query result is
“fit” for its purpose, by measuring the level of completeness of the result.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 Tagging Completeness Information to Data</title>
      <p>
        To enable quality aware query processing, data sources should provide quality
information relating to each stored XML document, e.g., the number of missing
elements/attributes in the document, the expected total of elements/attributes, as well
as the number of missing instance values, and the expected total of instance values,
required to measure model completeness and data completeness for the document.
The information needs to be tagged and delivered to the Internet Query System
mediator so that quality assessment query processing takes place. Figure 3.1
illustrates the mechanism for tagging quality information to XML data. We have
adapted the mechanism proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for tagging data quality information on
relational data.
      </p>
      <p>&lt;!ELEMENT carDealerInformation (dealer*, dataQuality)&gt;
&lt;!ELEMENT dealer (name, car*)&gt;
&lt;!ATTLIST dealer id ID #REQUIRED&gt;
&lt;!ELEMENT name (#PCDATA)&gt;
&lt;!ELEMENT car (model, price)&gt;
&lt;!ELEMENT model (#PCDATA)&gt;
&lt;!ELEMENT price (#PCDATA)&gt;
&lt;!ELEMENT dataQuality (completeness)&gt;
&lt;!ELEMENT completeness (numberElements, missingElements, numberValues, missingValues)&gt;
&lt;!ELEMENT numberElements (#PCDATA)&gt;
&lt;!ELEMENT missingElements (numberMissingElem, elem*)&gt;
&lt;!ELEMENT numberValues (#PCDATA)&gt;
&lt;!ELEMENT missingValues (numberMissingVal, elem*)&gt;
&lt;!ELEMENT numberMissingElem (#PCDATA)&gt;
&lt;!ELEMENT numberMissingVal (#PCDATA)&gt;
&lt;!ELEMENT elem (name, number)&gt;
&lt;!ELEMENT name (#PCDATA)&gt;
&lt;!ELEMENT number (#PCDATA)&gt;</p>
      <p>Fig 3.1 XML Data Tagging Mechanism.</p>
    </sec>
    <sec id="sec-3">
      <title>4 Quality Aware Algebraic Query Processing</title>
      <p>The quality aware query processing implementation framework described in this
paper is being developed as an extension to the Niagara IQS algebraic operators.
When a query is submitted to Niagara as an XML-based query expression, it is
transformed into two sub-queries, a search engine query and a query engine query.
While the former is used by the search engine to select the data sources that are
relevant to answer the query, the latter is optimized and ultimately mapped into a
quality aware algebraic query execution plan that incorporates algebraic operators
addressing completeness information. Following data source selection, the process of
fetching data takes place, and streams of data start flowing from the data sources to
the site of the Internet Query System for query execution. This process is illustrated in
Figure 4.1.</p>
      <sec id="sec-3-1">
        <title>Q ueQryuEernygine</title>
        <p>Q U E RY</p>
      </sec>
      <sec id="sec-3-2">
        <title>SearQchueErnygine</title>
        <p>D A TA SO U RC E
SELEC TIO N
Q U ER Y PRO C ESSIN G</p>
        <sec id="sec-3-2-1">
          <title>O P TQIMUIESRAYTIO N</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>EXQECUUERT IYO N</title>
          <p>D ata</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>FETDCAHT AIN G</title>
          <p>Q uery
R esults</p>
          <p>Fig 4.1 Query Processing and Data Search in Niagara.</p>
          <p>
            The Completeness Algebra whose operators compose a query execution plan is an
XML algebra extended with an operator that encapsulates the capability of measuring
completeness quality of XML data based on completeness factors tagged on the data.
The algebraic query processing framework adopted in our implementation extends
algebraic quality operators developed for relational systems [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] to devise an
XMLbased algebra for the Niagara IQS that can take into account completeness quality
information during query execution. The Completeness Algebra is similar to an
XML-algebra, but it has an additional operator, the Completeness operator, which
encapsulates functions for measuring, inserting and propagating completeness
information in XML data, provided the data has completeness factors associated with
it (IQR tags).
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] an approach for data quality management in Cooperative Information Systems
is described. The architecture has as its main component a Data Quality Broker,
which performs data requests on all cooperating systems on behalf of a requesting
system. The request is a query expressed in the XQuery language along with a set of
quality requirements that the desired data have to satisfy. A typical feature of
cooperative query systems is the high degree of data replication, with different copies
of the same data received as responses. The responses are reconciled and the best
results (based on quality thresholds) are selected and delivered to users, who can
choose to discard output data and adopt higher quality alternatives. All cooperating
systems export their application data and quality data thresholds, so that quality
certification and diffusion are ensured by the system. The system, however, does not
adopt an algebraic query processing framework and is not built on top of a
mainstream IQS.In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], data quality is incorporated into schema integration by
answering a global query using only queries that are classified as high quality and
executable by a subset of the data sources. This is done by assigning quality scores to
queries based on previous knowledge about the data to be queried, considering quality
dimensions such as completeness, timeliness and accuracy. The queries are ranked
according to their scores and executed from the highest quality plan to the lowest
quality plan until a stop criteria is reached. The described approach, however, does
not use XML as the canonical data model and does not address physical algebraic
query plan implementation issues.
      </p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions and Future Work</title>
      <p>
        With the ubiquitous growth, availability, and usage of data on the web, addressing
data quality requirements in connection with web queries is emerging as a key priority
for database research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are two established approaches for addressing data
quality issues relating to web data: data warehouse-based, where relevant data is
reconciled, cleansed and warehoused prior to querying; and mediator-based where
quality metrics and thresholds relating to cooperative web data sources are evaluated
“on the fly” at query processing and execution time. In this paper we illustrate the
query processing extensions being engineered into the Niagara internet query system
to support mediator-based quality aware query processing for the completeness data
quality dimension. We are also addressing the timeliness dimension [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and extending
SQL with data quality constructs to express data quality requirements [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The data
quality aware query processing extensions encompass metadata support, an
XMLbased data quality measurement method, algebraic query processing operators, and
query plan structures of a query processing framework aimed at helping users to
identify, assess, and filter out data regarded as of low completeness data quality for
the intended use. As future plans we intend to incorporate accuracy data quality
support into the framework and benchmark the quality/cost query optimiser in
connection with a health care application.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Naughton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>DeWitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          , et al:
          <article-title>The Niagara Internet Query System</article-title>
          .
          <source>IEEE Data Eng. Bull</source>
          .
          <volume>24</volume>
          (
          <issue>2</issue>
          ):
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Olson</surname>
          </string-name>
          ,
          <article-title>Data Quality: the Accuracy Dimension</article-title>
          , Morgan Kauffmann,
          <year>2003</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gertz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ozsu</surname>
          </string-name>
          , G. Saake,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Sattler: Data Quality on the Web</article-title>
          , Dagstuhl Seminar, Germany,
          <year>2003</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Wang</surname>
          </string-name>
          , Stuart E. Madnick:
          <article-title>The Inter-Database Instance Identification Problem in Integrating Autonomous Systems</article-title>
          ,
          <source>Proceedings of ICDE Conference</source>
          ,
          <volume>46</volume>
          -
          <fpage>55</fpage>
          , (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wang; M.P. Reddy; H. B. Kon</surname>
          </string-name>
          ,
          <article-title>Toward Quality data: An attribute-based approach, Decision Support Systems 13 (</article-title>
          <year>1995</year>
          ),
          <fpage>349</fpage>
          -
          <lpage>372</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. F. M.</given-names>
            <surname>Sampaio; C. Dong; P. R. F. Sampaio</surname>
          </string-name>
          ,
          <article-title>Incorporating the Timeliness Quality Dimension in Internet Query Systems</article-title>
          ,
          <source>WISE 2005 Workshops, LNCS 3807</source>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dong; S. F. M. Sampaio; P. R. F. Sampaio</surname>
          </string-name>
          <article-title>: Expressing and Processing Timeliness Quality Aware Queries: The DQ2L Approach</article-title>
          , to appear
          <source>in International Workshop on Quality of Information Systems, ER 2006 Workshops, LNCS</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Lesser; J. Freytag</surname>
          </string-name>
          , Quality-driven
          <source>Integration of Heterogeneous Information Systems; Proceedings of the 25th VLDB Conference</source>
          , Scotland,
          <year>1999</year>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Virgillito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baldoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Catarci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          , The DaQuinCIS Broker:
          <article-title>Querying Data and Their Quality in Cooperative Information Systems</article-title>
          . LNCS 2800. Pages:
          <fpage>208</fpage>
          -
          <lpage>232</lpage>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>