<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Query-driven and Incremental Process for Entity Resolution</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Priscilla</forename><surname>Kelly</surname></persName>
						</author>
						<author>
							<persName><forename type="first">M</forename><surname>Viera</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Center for Informatics</orgName>
								<orgName type="institution">Federal University of Pernambuco</orgName>
								<address>
									<settlement>Recife</settlement>
									<region>Pernambuco</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Federal Rural University of Pernambuco</orgName>
								<address>
									<settlement>Recife</settlement>
									<region>Pernambuco</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ana</forename><forename type="middle">Carolina</forename><surname>Salgado</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Center for Informatics</orgName>
								<orgName type="institution">Federal University of Pernambuco</orgName>
								<address>
									<settlement>Recife</settlement>
									<region>Pernambuco</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Bernadette</forename><surname>Farias Lóscio</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Center for Informatics</orgName>
								<orgName type="institution">Federal University of Pernambuco</orgName>
								<address>
									<settlement>Recife</settlement>
									<region>Pernambuco</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Query-driven and Incremental Process for Entity Resolution</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D9FFECD9B17D4C5988790BD2424F3DCD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T04:48+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Companies and governmental organizations around the world publish a huge volume of data, which can be stored in multiple data sources. In order to access and analyze these data, strategies for data integration are needed. The aim of data integration is to combine heterogeneous and autonomous data sources for providing a single view to the user <ref type="bibr" target="#b0">[1]</ref>. An important component of the data integration process is the Entity Resolution (ER) task <ref type="bibr" target="#b1">[2]</ref>. The ER goal is to identify tuples referring to the same realword entity (in this work, tuple is synonymous of instance and record). This problem is known by a variety of names: Record Linkage, Entity Resolution, Object Reference, Reference Linkage, Duplicate Detection or Deduplication. In this paper, we adopt the term Entity Resolution (ER) <ref type="bibr" target="#b1">[2]</ref>.</p><p>Often, companies and organizations have to deal with dynamic data sources with a large volume of data. In this context, the ER process can be very challenging because most current available ER techniques process all the entities at one time <ref type="bibr" target="#b2">[3]</ref>. This occurs because most of these techniques are based on batch algorithms, which resolve all tuples instead of resolving those related to a single query <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. Then, arises the need of new techniques to support real-time ER for dynamic and large databases.</p><p>For example, suppose a set of data sources of bibliographic data and a query to retrieve all papers from a given author (e.g. "Getoor"). To answer this query, it is not necessary to look for other author's papers and to perform the ER considering the whole set of papers. In this case, it would be better to focus on the tuples describing just papers from the author specified in the query.</p><p>In this paper, we propose a QUery-Driven and Incremental process for Entity Resolution (QuID). The QuID process considers query results on multiple data sources. It is an incremental process, i.e., for each new query result, QuID reuses the previous ER clusters to answer future queries. In our approach, ER is considered as a clustering problem <ref type="bibr" target="#b6">[7]</ref>, in which each cluster corresponds to tuples of a single real-world entity. During the ER, the results of queries are analyzed, and each tuple of the query result is inserted incrementally in a cluster. Our solution holds an index for the tuples, and performs incremental clustering, resulting in clusters of tuples that refer to the same real-world entity. The rest of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we formally define the problem and describe the QuID process and in Section 4 we conclude.</p><p>Bhattacharya and Getoor <ref type="bibr" target="#b3">[4]</ref> proposed a strategy adjusted for query-time entity resolution by identifying and resolving only those database references that are the most helpful for processing a given query. Altwaijry <ref type="bibr" target="#b4">[5]</ref> proposed a query-driven approach to ER, exploiting the specificity and semantics of the given SQL query. Both papers do not propose to reuse previous results of the ER process. The solution proposed by Gruenheid <ref type="bibr" target="#b2">[3]</ref> uses an incremental clustering algorithm to perform ER. Each inserted tuple is compared with existing clusters, either putting the tuple into an existing cluster, or creating a new cluster for it, using extra information from the data updates to fix previous cluster problems. This solution does not consider query results during the ER task. Different from the mentioned approaches, the process proposed in this paper is incremental and query-driven. To the best of our knowledge there are no other approaches that combine these two features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Statement</head><p>In this section we formally define the problem of query-driven and incremental ER (Section 3.1). We then describe our Query-Driven and Incremental process for Entity Resolution (QuID) (Section 3.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Problem Definition</head><p>Given a set of tuples, the ER process is essentially a clustering problem, in which each cluster contains tuples that represent a single real-world entity. If we consider the ER problem in multiple data sources, each tuple can be from a different source.</p><p>In this paper, our focus is on incremental clustering algorithms. The goal of the incremental clustering approach is to make the ER process faster than other processes that do not use this strategy. The main goal of using the query results is to reduce the volume of tuples. This strategy will also reduce the number of comparisons made between tuples.</p><p>Formally, we denote S = {S 1 , S 2 , ..., S n } a set of data sources and Q = {Q 1 , Q 2 , ..., Q m } a set of queries running on S. Each source has a set of entities S i .E, where E = {E 1 , E 2 , ..., E w }. Each entity E j from S i .E has a set of tuples S i .E j .T = {t 1 , t 2 , ..., t n }, where each t p is an instance of the entity E j . A tuple t p is defined as follows.</p><p>Definition 1. Each tuple t p belonging to S i .E j .T, is represented by a set of pairs of attributes (A k ) and values (v k ), t p = (𝑆 # . 𝐸 &amp; . 𝐴 ( , 𝑣 ( , 𝑆 # . 𝐸 &amp; . 𝐴 + , 𝑣 + , … , (𝑆 # . 𝐸 &amp; . 𝐴 -, 𝑣 -)}. Each attribute A k belongs to an entity (E j ) of a data source (S i ), denoted by S i .E j .A k . Each tuple t p has a pair (𝑆 # . 𝐸 &amp; . 𝐴 0 , 𝑣 0 ), which represents a single identifier of the tuple (Id).</p><p>A query Q i may not contain all the attributes necessary (relevant) to define whether two tuples represent the same real-world entity. Thus, the query is submitted to an expansion process for collecting the relevant attributes <ref type="bibr" target="#b7">[8]</ref> that were not informed in the initial query. This expansion generates a query Q i ' . The input of the QuID process is the result of the query Q i ', defined as follows.</p><p>Definition 2. A query result, Q i '.R, is represented by a set of tuples (Definition 1) that belongs to an entity E j. . The attributes that describes the tuples of the result Q i '.R includes the set of relevant attributes (A r ), S i .E j .A r , where S i .E j .A r ⊆ S i .E j .A.</p><p>For each new received query result, the ER process reuses the results of previous ER tasks, i.e., previous generated clusters, to respond the query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">QuID</head><p>In this section, we describe the proposed process (QuID). Fig. <ref type="figure" target="#fig_0">1</ref> shows the flow of information in QuID. The input of the process is a query result (Q' i .R'). The process starts with the Indexing step, which aims to reduce the number of comparisons between pairs of tuples. During this step, two indexes are used: the Similarity Index and the Cluster Index. The first one maintains incrementally the similarity values between each pair of tuples. The second one maintains incrementally a set of clusters of tuples identifiers. After the Indexing step, the local cluster (L c ) is initialized from G c , reusing the results of previous ER tasks. After the initialization of L c , the tuples not processed previously will be processed during the Tuple Pair Comparison step. In this step, similarity values are recovered from the Similarity Index, or new similarity values between two tuples are calculated.</p><p>After the Tuple Pair Comparison phase, the next step is the Incremental Clustering. The input of this task is a similarity graph, where nodes are tuples, and similarity values between tuples are edges. The goal of the Incremental Clustering is to insert into the local cluster (L c ) and global cluster (G c ) the tuples not processed before. Finally, after the Incremental Clustering, the output of QuID is L c and G c already updated for reuse in the next ER tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusions</head><p>In this short paper, we introduced and motivated an incremental and query-driven Entity Resolution process, denoted QuID. We also presented the main components of QuID and some important definitions related to our proposal. In the current state of our work, we implemented the two proposed indexes (cluster index and similarity index). Currently, we are investigating and evaluating the impact of the incremental clustering algorithm <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref> in the context of the proposed process. As future work, we will instantiate and evaluate the complete process.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Proposed process (QuID) Our approach, uses two types of clusters: global clusters and local clusters. Global Clusters (G c ) are created only once and updated, incrementally, at each query result Q i '.R'. A G c offers support to the query-driven process reusing previous results in future queries. A global cluster is defined in the following. Definition 3. A Global Cluster (G c ) is defined by a set of triples, 𝐺 2 = 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐼𝑑, 𝑆 # . 𝐸 &amp; , 𝑆 # . 𝐸 &amp; . 𝑡 = . 𝐼𝑑 , where ClusterId is an identifier of the cluster, S i .E j is the entity and the data source of the tuple t p and S i .E j .t p .Id is the tuple identifier.Local Clusters (L c ) are created for each query result Q i '.R'. The output of the ER process is the L c containing the duplicated tuples detected in the query result. L c will use previously classified information from the global cluster G c . We define local cluster as follows.</figDesc><graphic coords="3,150.52,332.20,295.68,205.92" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Ontology-based Data Management</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lenzerini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">international conference on Information and knowledge management (CIKM&apos;11)</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="5" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection</title>
		<author>
			<persName><forename type="first">P</forename><surname>Christen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Incremental Record Linkage</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gruenheid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Srivastava</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">VLDB&apos;2014</title>
				<meeting><address><addrLine>Hangzhou, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Query-time Entity Resolution</title>
		<author>
			<persName><forename type="first">I</forename><surname>Bhattacharya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Getoor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Reserche</title>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Query-Driven Aproach to Entity Resolution</title>
		<author>
			<persName><forename type="first">H</forename><surname>Altwaijry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Kalashnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mehrotra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">VLDB 2013</title>
				<meeting><address><addrLine>Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Record Matching Over Query Results from Multiple Web Databases</title>
		<author>
			<persName><forename type="first">W</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lochovsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Grouping Multidimensional Data: Recent Advances in Clustering</title>
		<author>
			<persName><forename type="first">P</forename><surname>Berkhin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>Springer</publisher>
			<biblScope unit="page" from="25" to="71" />
			<pubPlace>Berlin Heidelberg</pubPlace>
		</imprint>
	</monogr>
	<note>A Survey of Clustering Data Mining Techniques</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Pay-As-You-Go Entity Resolution</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Whang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Marmaros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Garcia-Molina</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">5</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
