A Query-driven and Incremental Process for Entity Resolution

Introduction

Companies and governmental organizations around the world publish a huge volume of data, which can be stored in multiple data sources. In order to access and analyze these data, strategies for data integration are needed. The aim of data integration is to combine heterogeneous and autonomous data sources for providing a single view to the user [1]. An important component of the data integration process is the Entity Resolution (ER) task [2]. The ER goal is to identify tuples referring to the same realword entity (in this work, tuple is synonymous of instance and record). This problem is known by a variety of names: Record Linkage, Entity Resolution, Object Reference, Reference Linkage, Duplicate Detection or Deduplication. In this paper, we adopt the term Entity Resolution (ER) [2].

Often, companies and organizations have to deal with dynamic data sources with a large volume of data. In this context, the ER process can be very challenging because most current available ER techniques process all the entities at one time [3]. This occurs because most of these techniques are based on batch algorithms, which resolve all tuples instead of resolving those related to a single query [4,5,6]. Then, arises the need of new techniques to support real-time ER for dynamic and large databases.

For example, suppose a set of data sources of bibliographic data and a query to retrieve all papers from a given author (e.g. "Getoor"). To answer this query, it is not necessary to look for other author's papers and to perform the ER considering the whole set of papers. In this case, it would be better to focus on the tuples describing just papers from the author specified in the query.

In this paper, we propose a QUery-Driven and Incremental process for Entity Resolution (QuID). The QuID process considers query results on multiple data sources. It is an incremental process, i.e., for each new query result, QuID reuses the previous ER clusters to answer future queries. In our approach, ER is considered as a clustering problem [7], in which each cluster corresponds to tuples of a single real-world entity. During the ER, the results of queries are analyzed, and each tuple of the query result is inserted incrementally in a cluster. Our solution holds an index for the tuples, and performs incremental clustering, resulting in clusters of tuples that refer to the same real-world entity. The rest of the paper is organized as follows. In Section 2 we discuss related work. In Section 3 we formally define the problem and describe the QuID process and in Section 4 we conclude.

Bhattacharya and Getoor [4] proposed a strategy adjusted for query-time entity resolution by identifying and resolving only those database references that are the most helpful for processing a given query. Altwaijry [5] proposed a query-driven approach to ER, exploiting the specificity and semantics of the given SQL query. Both papers do not propose to reuse previous results of the ER process. The solution proposed by Gruenheid [3] uses an incremental clustering algorithm to perform ER. Each inserted tuple is compared with existing clusters, either putting the tuple into an existing cluster, or creating a new cluster for it, using extra information from the data updates to fix previous cluster problems. This solution does not consider query results during the ER task. Different from the mentioned approaches, the process proposed in this paper is incremental and query-driven. To the best of our knowledge there are no other approaches that combine these two features.

Problem Statement

In this section we formally define the problem of query-driven and incremental ER (Section 3.1). We then describe our Query-Driven and Incremental process for Entity Resolution (QuID) (Section 3.2).

Problem Definition

Given a set of tuples, the ER process is essentially a clustering problem, in which each cluster contains tuples that represent a single real-world entity. If we consider the ER problem in multiple data sources, each tuple can be from a different source.

In this paper, our focus is on incremental clustering algorithms. The goal of the incremental clustering approach is to make the ER process faster than other processes that do not use this strategy. The main goal of using the query results is to reduce the volume of tuples. This strategy will also reduce the number of comparisons made between tuples.

Formally, we denote S = {S 1 , S 2 , ..., S n } a set of data sources and Q = {Q 1 , Q 2 , ..., Q m } a set of queries running on S. Each source has a set of entities S i .E, where E = {E 1 , E 2 , ..., E w }. Each entity E j from S i .E has a set of tuples S i .E j .T = {t 1 , t 2 , ..., t n }, where each t p is an instance of the entity E j . A tuple t p is defined as follows.

Definition 1. Each tuple t p belonging to S i .E j .T, is represented by a set of pairs of attributes (A k ) and values (v k ), t p = (𝑆 # . 𝐸 & . 𝐴 ( , 𝑣 ( , 𝑆 # . 𝐸 & . 𝐴 + , 𝑣 + , … , (𝑆 # . 𝐸 & . 𝐴 -, 𝑣 -)}. Each attribute A k belongs to an entity (E j ) of a data source (S i ), denoted by S i .E j .A k . Each tuple t p has a pair (𝑆 # . 𝐸 & . 𝐴 0 , 𝑣 0 ), which represents a single identifier of the tuple (Id).

A query Q i may not contain all the attributes necessary (relevant) to define whether two tuples represent the same real-world entity. Thus, the query is submitted to an expansion process for collecting the relevant attributes [8] that were not informed in the initial query. This expansion generates a query Q i ' . The input of the QuID process is the result of the query Q i ', defined as follows.

Definition 2. A query result, Q i '.R, is represented by a set of tuples (Definition 1) that belongs to an entity E j. . The attributes that describes the tuples of the result Q i '.R includes the set of relevant attributes (A r ), S i .E j .A r , where S i .E j .A r ⊆ S i .E j .A.

For each new received query result, the ER process reuses the results of previous ER tasks, i.e., previous generated clusters, to respond the query.

QuID

In this section, we describe the proposed process (QuID). Fig. 1 shows the flow of information in QuID. The input of the process is a query result (Q' i .R'). The process starts with the Indexing step, which aims to reduce the number of comparisons between pairs of tuples. During this step, two indexes are used: the Similarity Index and the Cluster Index. The first one maintains incrementally the similarity values between each pair of tuples. The second one maintains incrementally a set of clusters of tuples identifiers. After the Indexing step, the local cluster (L c ) is initialized from G c , reusing the results of previous ER tasks. After the initialization of L c , the tuples not processed previously will be processed during the Tuple Pair Comparison step. In this step, similarity values are recovered from the Similarity Index, or new similarity values between two tuples are calculated.

After the Tuple Pair Comparison phase, the next step is the Incremental Clustering. The input of this task is a similarity graph, where nodes are tuples, and similarity values between tuples are edges. The goal of the Incremental Clustering is to insert into the local cluster (L c ) and global cluster (G c ) the tuples not processed before. Finally, after the Incremental Clustering, the output of QuID is L c and G c already updated for reuse in the next ER tasks.

Conclusions

In this short paper, we introduced and motivated an incremental and query-driven Entity Resolution process, denoted QuID. We also presented the main components of QuID and some important definitions related to our proposal. In the current state of our work, we implemented the two proposed indexes (cluster index and similarity index). Currently, we are investigating and evaluating the impact of the incremental clustering algorithm [3,4] in the context of the proposed process. As future work, we will instantiate and evaluate the complete process.

Fig. 1 .1Fig. 1. Proposed process (QuID) Our approach, uses two types of clusters: global clusters and local clusters. Global Clusters (G c ) are created only once and updated, incrementally, at each query result Q i '.R'. A G c offers support to the query-driven process reusing previous results in future queries. A global cluster is defined in the following. Definition 3. A Global Cluster (G c ) is defined by a set of triples, 𝐺 2 = 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝐼𝑑, 𝑆 # . 𝐸 & , 𝑆 # . 𝐸 & . 𝑡 = . 𝐼𝑑 , where ClusterId is an identifier of the cluster, S i .E j is the entity and the data source of the tuple t p and S i .E j .t p .Id is the tuple identifier.Local Clusters (L c ) are created for each query result Q i '.R'. The output of the ER process is the L c containing the duplicated tuples detected in the query result. L c will use previously classified information from the global cluster G c . We define local cluster as follows.

Ontology-based Data Management MLenzerini international conference on Information and knowledge management (CIKM'11)

New York, NY, USA

2011 Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection PChristen 2012 Springer Incremental Record Linkage AGruenheid XLDong DSrivastava VLDB'2014

Hangzhou, China

2014 Query-time Entity Resolution IBhattacharya LGetoor Journal of Artificial Intelligence Reserche 2007 Query-Driven Aproach to Entity Resolution HAltwaijry DDKalashnikov SMehrotra VLDB 2013

Italy

2013 Record Matching Over Query Results from Multiple Web Databases WSu JWang FLochovsky H IEEE Transactions on Knowledge and Data Engineering 22 4 2010 Grouping Multidimensional Data: Recent Advances in Clustering PBerkhin 2006 Springer Berlin Heidelberg A Survey of Clustering Data Mining Techniques Pay-As-You-Go Entity Resolution SEWhang DMarmaros HGarcia-Molina IEEE Transactions on Knowledge and Data Engineering 25 5 2013