<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Incorporating Completeness Quality Support in Internet Query Systems</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sandra</forename><forename type="middle">De F</forename><surname>Mendes Sampaio</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Informatics</orgName>
								<orgName type="institution">University of Manchester</orgName>
								<address>
									<postCode>M60 1QD</postCode>
									<settlement>Manchester, {S.Sampaio</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pedro</forename><forename type="middle">R</forename><surname>Falcone Sampaio</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Informatics</orgName>
								<orgName type="institution">University of Manchester</orgName>
								<address>
									<postCode>M60 1QD</postCode>
									<settlement>Manchester, {S.Sampaio</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Incorporating Completeness Quality Support in Internet Query Systems</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8FA098C0D16DE497BB3EC0858FEA0366</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T06:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>There has been an exponential growth in the availability of data on the web and in the usage of systems and tools for querying and retrieving web data. Despite the considerable advances in search engines and other internet technologies for dynamically combining, integrating and collating web data, supporting a DBMS-like data management approach across multiple web data sources is still an elusive goal. To buck this trend, internet query systems − IQS <ref type="bibr" target="#b0">[1]</ref> are being developed to enable DBMS-like query processing and data management over multiple web data sources, shielding the user from complexities such as information heterogeneity, unpredictability of data source response rates, and distributed query execution.</p><p>The comprehensive query processing approach supported by IQS allows users to query a global information system without being aware of the sites structure, query languages, and semantics of the data repositories that store the relevant data for a given query <ref type="bibr" target="#b0">[1]</ref>. Despite the significant amount of work in the development of the data integration and distributed query processing capabilities, internet query systems still suffer from inadequate data quality control mechanisms to address the management of quality of the data retrieved and processed by the IQS. Typical examples of data quality issues <ref type="bibr" target="#b1">[2]</ref> that need to be addressed when supporting quality aware query processing over multiple web data sources are: Accuracy, Completeness and Timeliness.</p><p>We are currently investigating how an internet query system can be extended to support a dynamic data quality aware query processing framework. In particular, we are developing Completeness extensions for the Niagara Internet Query System.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Measuring Model and Data Completeness of XML Data</head><p>Completeness is a context-dependent data quality dimension that refers to "the extent to which data are of sufficient breadth, depth and scope for the task at hand" <ref type="bibr" target="#b3">[4]</ref>. In the context of a database model, two types of completeness dimensions are considered: model completeness and data completeness. Model completeness refers to the measure of how appropriate the schema of the database is for a particular application; data completeness refers to the measurable errors of omission observed between the database and its schema, checking, for example, if a database contains all entities/attributes specified in the schema. Completeness issues arising in database applications may have several causes, for example, discrepancies between the intent for information querying and the collected data, partial capture of data semantics during data modeling, and the loss of data resulting from data exchange. Potential approaches to address completeness issues include removing entities with missing values from the database; replacing missing values with default values, and completing missing values with data from other sources. Irrespective of the approach taken to deal with poor data completeness, it is crucial that database users formulating queries across multiple data sources are able to judge if a particular query result is "fit" for its purpose, by measuring the level of completeness of the result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Tagging Completeness Information to Data</head><p>To enable quality aware query processing, data sources should provide quality information relating to each stored XML document, e.g., the number of missing elements/attributes in the document, the expected total of elements/attributes, as well as the number of missing instance values, and the expected total of instance values, required to measure model completeness and data completeness for the document. The information needs to be tagged and delivered to the Internet Query System mediator so that quality assessment query processing takes place. Figure <ref type="figure">3</ref>.1 illustrates the mechanism for tagging quality information to XML data. We have adapted the mechanism proposed in <ref type="bibr" target="#b4">[5]</ref> for tagging data quality information on relational data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Quality Aware Algebraic Query Processing</head><p>The quality aware query processing implementation framework described in this paper is being developed as an extension to the Niagara IQS algebraic operators. When a query is submitted to Niagara as an XML-based query expression, it is transformed into two sub-queries, a search engine query and a query engine query.</p><p>While the former is used by the search engine to select the data sources that are relevant to answer the query, the latter is optimized and ultimately mapped into a quality aware algebraic query execution plan that incorporates algebraic operators addressing completeness information. Following data source selection, the process of fetching data takes place, and streams of data start flowing from the data sources to the site of the Internet Query System for query execution. This process is illustrated in Figure <ref type="figure">4</ref>.1. The Completeness Algebra whose operators compose a query execution plan is an XML algebra extended with an operator that encapsulates the capability of measuring completeness quality of XML data based on completeness factors tagged on the data. The algebraic query processing framework adopted in our implementation extends algebraic quality operators developed for relational systems <ref type="bibr" target="#b4">[5]</ref> to devise an XMLbased algebra for the Niagara IQS that can take into account completeness quality information during query execution. The Completeness Algebra is similar to an XML-algebra, but it has an additional operator, the Completeness operator, which encapsulates functions for measuring, inserting and propagating completeness information in XML data, provided the data has completeness factors associated with it (IQR tags).</p><formula xml:id="formula_0">Q U E R Y Q U E R Y P R O C E SSIN G Q U E R Y O P T IM IS A T IO N Q U E R Y E X E C U T IO N Q u e ry E n g in e Q u e</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>r y S e a rch E ng in e Q ue ry</head><formula xml:id="formula_1">D A T A S O U R C E S E L E C T IO N D A T A F E T C H IN G Q u ery R e s u lts D a ta</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>In <ref type="bibr" target="#b13">[9]</ref> an approach for data quality management in Cooperative Information Systems is described. The architecture has as its main component a Data Quality Broker, which performs data requests on all cooperating systems on behalf of a requesting system. The request is a query expressed in the XQuery language along with a set of quality requirements that the desired data have to satisfy. A typical feature of cooperative query systems is the high degree of data replication, with different copies of the same data received as responses. The responses are reconciled and the best results (based on quality thresholds) are selected and delivered to users, who can choose to discard output data and adopt higher quality alternatives. All cooperating systems export their application data and quality data thresholds, so that quality certification and diffusion are ensured by the system. The system, however, does not adopt an algebraic query processing framework and is not built on top of a mainstream IQS.In <ref type="bibr" target="#b11">[8]</ref>, data quality is incorporated into schema integration by answering a global query using only queries that are classified as high quality and executable by a subset of the data sources. This is done by assigning quality scores to queries based on previous knowledge about the data to be queried, considering quality dimensions such as completeness, timeliness and accuracy. The queries are ranked according to their scores and executed from the highest quality plan to the lowest quality plan until a stop criteria is reached. The described approach, however, does not use XML as the canonical data model and does not address physical algebraic query plan implementation issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Work</head><p>With the ubiquitous growth, availability, and usage of data on the web, addressing data quality requirements in connection with web queries is emerging as a key priority for database research <ref type="bibr" target="#b2">[3]</ref>. There are two established approaches for addressing data quality issues relating to web data: data warehouse-based, where relevant data is reconciled, cleansed and warehoused prior to querying; and mediator-based where quality metrics and thresholds relating to cooperative web data sources are evaluated "on the fly" at query processing and execution time. In this paper we illustrate the query processing extensions being engineered into the Niagara internet query system to support mediator-based quality aware query processing for the completeness data quality dimension. We are also addressing the timeliness dimension <ref type="bibr" target="#b7">[6]</ref> and extending SQL with data quality constructs to express data quality requirements <ref type="bibr" target="#b8">[7]</ref>. The data quality aware query processing extensions encompass metadata support, an XMLbased data quality measurement method, algebraic query processing operators, and query plan structures of a query processing framework aimed at helping users to identify, assess, and filter out data regarded as of low completeness data quality for the intended use. As future plans we intend to incorporate accuracy data quality support into the framework and benchmark the quality/cost query optimiser in connection with a health care application.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig 3 . 1</head><label>31</label><figDesc>Fig 3.1 XML Data Tagging Mechanism.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig 4 . 1</head><label>41</label><figDesc>Fig 4.1 Query Processing and Data Search in Niagara.</figDesc></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The Niagara Internet Query System</title>
		<author>
			<persName><forename type="first">J</forename><surname>Naughton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Dewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Maier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Data Eng. Bull</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="27" to="33" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Olson</surname></persName>
		</author>
		<title level="m">Data Quality: the Accuracy Dimension</title>
				<imprint>
			<publisher>Morgan Kauffmann</publisher>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Data Quality on the Web</title>
		<author>
			<persName><forename type="first">M</forename><surname>Gertz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ozsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Saake</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Sattler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Dagstuhl Seminar</title>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Madnick: The Inter-Database Instance Identification Problem in Integrating Autonomous Systems</title>
		<author>
			<persName><forename type="first">Richard</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stuart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICDE Conference</title>
				<meeting>ICDE Conference</meeting>
		<imprint>
			<date type="published" when="1989">1989</date>
			<biblScope unit="page" from="46" to="55" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">Y</forename><surname>Wang</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Reddy</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Toward Quality data: An attribute-based approach</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">B</forename><surname>Kon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Decision Support Systems</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="349" to="372" />
			<date type="published" when="1995">1995. 1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Incorporating the Timeliness Quality Dimension in Internet Query Systems</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">F M</forename><surname>Sampaio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">R F</forename><surname>Sampaio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WISE 2005 Workshops</title>
				<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="volume">3807</biblScope>
			<biblScope unit="page" from="53" to="62" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">C</forename><surname>Dong</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">F M</forename><surname>Sampaio</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Expressing and Processing Timeliness Quality Aware Queries: The DQ 2 L Approach, to appear in International</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">R F</forename><surname>Sampaio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Quality of Information Systems, ER 2006 Workshops</title>
				<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">F</forename><surname>Naumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Lesser</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Quality-driven Integration of Heterogeneous Information Systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Freytag</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 25th VLDB Conference</title>
				<meeting>the 25th VLDB Conference<address><addrLine>Scotland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The DaQuinCIS Broker: Querying Data and Their Quality in Cooperative Information Systems</title>
		<author>
			<persName><forename type="first">M</forename><surname>Mecella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Scannapieco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Virgillito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Baldoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Catarci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Batini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">LNCS</title>
		<imprint>
			<biblScope unit="volume">2800</biblScope>
			<biblScope unit="page" from="208" to="232" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
