<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On Customer Data Deduplication: Lessons Learned from a R&amp;D Project in the Financial Sector</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paweł Boiński</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariusz Sienkiewicz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bartosz Bębel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Wrembel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dariusz Gałęzowski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Waldemar Graniszewski</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAMA Poland</institution>
          ,
          <addr-line>Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Poznan University of Technology</institution>
          ,
          <addr-line>Poznań</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Warsaw University of Technology</institution>
          ,
          <addr-line>Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <issue>2022</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Despite the fact that financial institutions (FIs) apply data governance strategies and use the most advanced state-of-the-art data management and data engineering software and systems to support their day-to-day businesses, their databases are not free from some faulty data (dirty and duplicated). In this paper, we report some conclusions from an ongoing research and development project for a FI. The goal of this project is to integrate customers' data from multiple data sources clean, homogenize, and deduplicate them. This paper, in particular, focuses on findings from developing customers' data deduplication process.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data quality</kwd>
        <kwd>data cleaning</kwd>
        <kwd>data deduplication pipeline</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Financial institutions (FIs) apply data governance
strategies and use the most advanced state-of-the-art data
management and data engineering software to manage data
collected by their day-to-day businesses. Unfortunately,
the application of advanced technologies does not
prevent from collecting and storing some faulty data - mainly
erroneous, outdated, and duplicated, e.g., [1]. Such data
mainly concern customers, both individuals and
institutions.</p>
      <p>Duplicated and outdated data cause economic loses,
increase customer dissatisfaction, and deteriorate a
reputation of a FI. For these reasons, data integration, cleaning,
and deduplication of customers’ is one of the processes
in data governance.</p>
      <p>In the research literature, a base-line data
deduplication pipeline has been proposed, e.g., [2, 3, 4]. It has
become a standard pipeline for multiple data
deduplication projects. The pipeline includes four basic tasks,
namely: (1) blocking (a.k.a. indexing), which arranges
records into groups, such that each group is likely to
include duplicates, (2) block processing (a.k.a. filtering),
which eliminates records that do not have to be
compared, (3) entity matching (a.k.a. similarity computation),
which computes similarity values between record pairs,
and (4) entity clustering, which creates larger clusters of
similar records.</p>
      <sec id="sec-1-1">
        <title>In this paper, we outline our experience and findings</title>
        <p>from designing a deduplication pipeline for customers’
data (Section 2). We discuss approaches that are
possible for each task in the pipeline and present particular
solutions that were proven to be adequate to solve the
addressed problem. Final conclusions are presented in
Section 3. Notice that this paper presents findings from
a real R&amp;D project, and therefore, not all details can be
revealed, as they are treated as the company know-how.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Deduplication pipeline in the project</title>
      <p>Inspired by the aforementioned base-line data
deduplication pipeline (BLDDP), in the described project, we
apply an adjusted pipeline that suits the goals of our
project. There are three basic diferences between our
pipeline and the BLDDP. First, in our pipeline, we
explicitly included all steps that we found to be crucial for
the deduplication process on customers’ data, whereas
in the BLDDP some steps are implicit. Second, the last
task in our pipeline allows to further merge some groups
of similar records (cliques), whereas the BLDDP, to the
best of our knowledge, does not include this task. Third,
our pipeline accepts dirty customers data, whereas the
BLDDP assumes that input data were cleaned beforehand.</p>
      <p>Our pipeline includes the following tasks (cf. Figure 1),
which are outlined in the remainder of the paper: [T1]
selecting grouping attributes, [T2] selecting attributes
used to compare record pairs, [T3] choosing a method for
comparing records, [T4] selecting similarity measures for
comparing values of attribute pairs, [T5] defining weights
of attributes to compute records similarities and choosing 2.2.2. T2: Attributes for comparing record pairs
similarity thresholds, [T6] building pairs of records
having high similarity value, [T7] building cliques of similar Having ordered records by the attributes from the
rankrecords, [T8] further merging cliques of similar records. ing obtained from task [T1], the next task is to select
[T1] realizes blocking in BLDDP; [T3] realizes block pro- attributes whose values will be compared in record pairs,
cessing and entity matching; [T2], [T4], and [T5] realize to compute the similarity of records in each pair.
Potenentity matching; [T6] and [T7] realize entity clustering. tial candidates for comparison are attributes that: (1) are
record identifiers, (2) do not include nulls, (3) include
cleaned values, e.g., no typos, no additional erroneous
2.1. Pipeline implementation characters, (4) include unified (homogenized) values, e.g.,
environments no abbreviations, the same acronyms used throughout
the whole data set.</p>
      <p>Tasks [T1] to [T6] were implemented in parallel in two Unfortunately, in real cases, such attributes often do
alternative environments. The first one is a typical data not exist. As it concerns record identifiers, in FI
appliengineering environment, based on the Oracle DBMS as a cations, natural identifiers are typically used, but their
data storage and the PL/SQL programming language for values are frequently artificially generated (in cases when
implementing the deduplication pipeline; this environ- natural identifiers cannot be used, i.e., a customer is not
ment is a standard one in the FI running the project. The able to provide it). Thus, in some cases, artificially
genersecond environment is a typical data science environment, ated IDs may have the same values as the natural ones.
based on csv files as a data storage and Python (Anaconda Notice that the financial sector is strictly regulated by
or Jupyter-lab, data science packages) for implementing means of European law, national law, and
recommendathe pipeline. tions issued by institutions controlling the sector. As a
consequence, procedures aiming at improving the
qual2.2. Tasks in the pipeline ity of data in this sector are strictly controlled. For this
2.2.1. T1: Grouping attributes reason, possibilities of applying data cleaning processes
are limited.</p>
      <p>The main challenge in grouping records is to select such a In the described project, the set of attributes selected
set of grouping attributes that would allow to identify the for comparing record pairs is based on the
aforemenhighest number of potentially duplicate records. Even tioned preferable attribute characteristics and on expert
though, in the research literature there were proposed 14 knowledge. The set includes 18 attributes describing
diferent blocking methods [ 5], there is no single univer- individual customers (e.g., personal data and address
sal grouping method suitable for all application domains components) and 24 attributes describing institutional
[6]. customers (e.g., institution names, addresses, type of
busi</p>
      <p>Inspired by [7], we proposed a method based on statis- ness run).
tical characteristics of customers attributes: (1) the
number of  to the number of  values of an at- 2.2.3. T3: A method for comparing records
tribute, (2) the number of  values to the
number of  values of an attribute, (3) the number of Based on the ranking of grouping attributes obtained
 to the number of  values of an at- from task [T1] (cf. Section 2.2.1), records need to be
tribute, (4) the number of ( − )/ arranged into groups. Next, in each group records are
values of an attribute. These characteristics are com- compared in pairs. The literature proposes two popular
puted for every attribute being a candidate for grouping. techniques for grouping, namely: hashing, e.g., [8, 9] or
Additionally, (1) a diversity of values of each attribute sorting - know as the sorted neighborhood method, e.g.,
is modeled by means of the Gini Index and (2) the size [10, 11].
of a record group is penalized by means of a quadratic The sorted neighborhood method accepts one
paramefunction with a negative value of coeficient a. ter that is the size of a sliding window in which records</p>
      <p>Notice that the initial set of potential grouping at- are compared. The larger the window size is the more
tributes was selected based on expert knowledge. It in- potential duplicates can be found, but the longer record
cluded 20 attributes. Next, the candidate attributes were comparison time is, since more records have to be
comprocessed by means of our method and their statistical pared each time. Some experimental evaluations of the
characteristics were computed. The obtained ranking window size from the literature discuss a typical size that
of attributes was verified by domain experts. Based on ranges from 2 to 60 records [10].
their input, the final set of attributes was selected for The sorted neighborhood method is intuitive, has an
arranging (grouping) records. acceptable computational complexity, and is available in
one of the Python libraries. For this reason, it was applied
in the project. We run a series of experiments in order
to determine the best window size. Our experiments
showed that the size has to be adjusted experimentally for
a particular data set being deduplicated. For comparing
individual customers’ records we used the window of 20
records, whereas for comparing institutional customers
we used a variable window size with the maximum of
200 records.</p>
      <sec id="sec-2-1">
        <title>2.2.4. T4: Similarity measures for text attributes</title>
        <sec id="sec-2-1-1">
          <title>The literature on data deduplication and similarity mea</title>
          <p>sures lists well over 30 diferent similarity measures for
text data, e.g., [12, 13]. One may find in the literature
suggestions, supported by experimental evaluations, on
the applicability of diferent measures to diferent text
data, e.g., [14, 15, 16, 12].</p>
          <p>In our project, we evaluated 44 measures available in
Python packages. Some measures, e.g., Levenshtein, Jaro,
Jaro-Winkler exist in a few diferent implementations
(packages), thus we evaluated these implementations as
well. The evaluation was run on three diferent real data
sets, i.e., (1) customers’ last names of average length of
10.9 characters, (2) street names of avg length of 16 chars,
and (3) institution names of avg length of 45.5 chars.
Customers’ names included 98% of 1-word names and 2%
of 2-word names, which reflected a real distribution of
such types of names in our customers’ population. All
test data represented true positives, but with typical real
errors found by data profiling.</p>
          <p>From the evaluation we draw the following
conclusions:
cannot be compared; in this case,  is not
considered for computing record similarity for pair
(, );
• when records IDs are compared but the value
of the ID for  is artificially generated and the
value of the ID for  is real, then such values
cannot be compared; in this case, such an ID is
not considered for computing record similarity
for pair (, );
• if IDs can be compared (i.e., they include real
values), then a binary similarity value is assigned
of either 1 (IDs are equal) or 0 (IDs are not equal)
for a compared pair of IDs.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>On top of the rules, for institutional customers we</title>
          <p>used equal weights for each attribute being the subject of
comparison. Whereas for individual customers, higher
weights were set for ID and last name attributes. These
weights were set based on an iterative experimentation
process and evaluation of the results by domain experts.</p>
          <p>Based on the weighted values of similarities between
pairs of individual attributes of records  and , a
total similarity of (, ) was computed. Let us denote
it as . Based on its value, a given pair of records
was classified either as similar (matches) or non-similar
(non-matches), or undecided. For this kind of
classification, the so-called similarity thresholds had to be defined.</p>
          <p>Again, in practice, these thresholds are defined based on
the analysis of the obtained record pairs and based on
knowledge of domain experts [12, 17].</p>
          <p>In our project we applied the same approach. Based on
the knowledge of the FI experts, the lowest value of 
(i.e., for similar records) was set to 0.8 and the highest
value for non-similar was set to 0.6.
• for short strings, like last names and street names
(composed of 7 to 28 characters), the Overlap,
Jaro-Winkler, and StrCmp95 similarity measures
gave the highest similarity values; 2.2.6. T6: Building pairs of similar records
• for long strings, like institution names (composed The sorted neighborhood method produces pairs of similar
of 46 to 116 characters and up to 12 separate
words) the Overlap, Sorensen, and StrCmp95 mea- irleacroitrydsvawluitehs: f(o1)r tehaecihr
oatvterribalultveablueeinogfcompaarnedd.(2T)hseimsesures gave the highest similarity values, there- data are stored in a repository, cf. Section 2.1 and
visualfore they were recommended for comparing such ized in a spreadsheet for expert verification.
kinds of data.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2.5. T5: Attribute weights and similarity thresholds</title>
        <p>• when  has a defined value of attribute  and
the value of  of  is null, then the values of</p>
        <sec id="sec-2-2-1">
          <title>Since similar records’ pairs may form larger sets, to find</title>
          <p>In task [T5], we applied an iterative process of tuning such sets, all similar pairs have to be combined in a graph,
weights of attributes used to compute records’ similarity, with records representing nodes and labeled edges
reprewith the support of domain experts. Additionally, rules senting similarities between records. In such a graph, a
had to be defined to decide whether to compare values group of similar records forms a maximal clique. Thus,
of a given attribute. Let us assume that a pair of records the problem of finding sets of similar records transforms
 and  is compared to compute their similarity value. to finding maximal cliques in a graph. In general, it is
Some cases handled by the rules include: a NP-hard problem [18]. This problem becomes
computationally less expensive for sparse graphs, e.g., [19, 20],
which is the case of a graph created from similar records.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.2.7. T7: Building cliques of similar records</title>
        <sec id="sec-2-3-1">
          <title>For finding maximal cliques a few fast algorithms were developed.</title>
          <p>One of them is the Bron-Kerbosh algorithm [21], which
we decided to use for the following reasons. First, it is
frequently used in the community working on graph
processing. Second, it is implemented in multiple
programming languages, including Python. Third, its worst
case computational complexity is (3/3), where 
denotes the number of graph nodes. For sparse graphs the
complexity is lower.</p>
          <p>The algorithm was used for finding cliques in a graph
composed of 2228580 customers’ nodes. This evaluation
confirmed its eficiency in terms of processing time and
its applicability to the deduplication problem (confirmed
by the experts from the FI).</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.2.8. T8: Merging cliques of similar records</title>
        <p>If the number of similar records is larger than the size
of a sliding window in sorted neighborhood, then a few
cliques are created and all of them contain records that
are similar to each other. Therefore, the final step is to
merge cliques that include a certain number of common
records (currently the Jaccard coeficient is used to decide
which cliques to merge). We are also experimenting with
a variable, automatically adjustable window size.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Final observations</title>
      <p>In this paper, we reported our experience from a R&amp;D
project for a FI on deduplicating customers’ data. The
project is ongoing (two out of four stages have been al- 3.3. Tagged data for ML
ready realized). In the project, we adapted the standard
deduplication pipeline from the literature to the
particular characteristics of data being deduplicated and to
the project requirements. The whole pipeline was
implemented and verified by domain experts. The results
obtained so far were accepted by the FI.</p>
      <p>It must be stressed that the reality of the discussed
project difers from the one assumed in the research
literature, i.e., (1) the assumption on the cleanness of data
being deduplicated, (2) the sizes of deduplicated data
sets, (3) the availability of tagged data for ML algorithms,
and (4) neglecting data aging process. Neither of these
assumptions is true in our project, as outlined in the
following sections.
3.1. Data cleanness
The base-line data deduplication pipeline [22, 2, 3, 23, 4]
assumes that data delivered to the pipeline are clean
(e.g., no null values, no spelling errors, homogenized full
names and abbreviations). Unfortunately, this
assumption in real projects cannot be guaranteed, especially in
the financial sector. There exist some typos, missing
values, inconsistent values in attributes storing personal
data, institution names, and addresses. Moreover, not all
natural IDs are reliable. By regulations, even known dirty
customers data cannot be cleaned without an explicitly
permission of a customer. Getting such permissions from
millions of customers in a finite time frame is
impossible. For this reason, only simple cleaning is possible,
like removing leading or trailing erroneous signs from
customers addresses. For this reason, in practice the
deduplication pipeline has to be applied to data that has
undergone only basic cleaning.
3.2. Data size</p>
      <sec id="sec-3-1">
        <title>Most of the methods used in the base-line data deduplica</title>
        <p>tion pipeline were verified on either small real data sets,
e.g., bibliographical with 32000 records [6, 24, 25, 17, 26],
restaurants - 500 records [24, 25], movies - 5000 records
[26], or patients - 128000 records [27], or on data sets
generated artificially [6, 7].</p>
        <p>Whereas, in this paper we reported our experience
on deduplicating customers’ data of much larger
volumes, i.e.,: (1) 2228580 records describing individual
customers and (2) 1185290 records describing institutional
customers. The final goal of the reported project is to
apply the developed pipeline and techniques to a database
storing more than 11 million of customers’ records (since
the project is ongoing, this stage will be run at the end
of the project).</p>
        <p>Some tasks in the deduplication pipeline can be run
with the support of machine learning (ML) techniques,
e.g., blocking [28, 29], selecting similarity measures and
thresholds [30], matching similar records [31, 32]. If the
pipeline applies ML for the entity matching task, it is
assumed that there exists a set of training records tagged
as true positives and true negatives. Unfortunately, in
a large FI it is impossible to create such a set of
training data because of the volume of data to be processed
by the pipeline. For an original data set composed of
several million of customers, a training data set of a
reasonable size should include at least several thousands of
tagged training records. In reality, such a large number
of training records is impossible to be created by experts.</p>
        <p>For this reason, in practice, training data are frequently
unavailable for ML algorithms.</p>
        <p>In order to overcome this dificulty, unsupervised
learning techniques are used, e.g., [33, 34]. Some publications
report on applying active learning techniques to a
deduplication process, e.g., [35, 36, 37, 38, 39] and this
direction will also be investigated in the project. Currently
we are experimenting with weakly supervised learning [4] G. Papadakis, L. Tsekouras, E. Thanos, G.
Gian[40] with the support of the snorkel library. nakopoulos, T. Palpanas, M. Koubarakis,
Domainand structure-agnostic end-to-end entity resolution
3.4. Data aging with jedai, SIGMOD Record 48 (2019) 30–36.
[5] A. Colyer, The morning paper on An overview of
An inherent feature of some types of data, is their aging. end-to-end entity resolution for big data, https:
For example, customers’ last names, identification doc- //blog.acolyer.org/2020/12/14/entity-resolution/,
uments, diferent types of postal addresses, and contact 2020.
data (phone numbers, emails) have this feature. Outdated [6] M. Bilenko, B. Kamath, R. J. Mooney, Adaptive
data impact the possibility to discover duplicate records. blocking: Learning to scale up record linkage, in:
For this reason, for a deduplication process it would be IEEE Int. Conf. on Data Mining (ICDM), IEEE
Comprofitable to know which pieces of compared data are puter Society, 2006, pp. 87–96.
likely to be outdated. [7] L. de Souza Silva, F. Murai, A. P. C. da Silva, M. M.</p>
        <p>Building data aging models has not been researched Moro, Automatic identification of best attributes
so far (the only approach addressing a related problem is for indexing in data deduplication, in: A.
Mendel[41], but in the context of temporal data). Including aging zon Int. Workshop on Foundations of Data
Managemodels into a deduplication pipeline seems to be totally ment, volume 2100 of CEUR Workshop Proceedings,
unexplored field of research either. In the last stage of our CEUR-WS.org, 2018.
project we aim at developing data aging models based [8] N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma,
on ML algorithms. T. Sarlós, Optimal hashing schemes for entity
matching, in: Int. World Wide Web Conf. WWW,
3.5. Working with experts 2013, pp. 295–306.
[9] H. Kim, D. Lee, HARRA: fast iterative hashed record
While designing the deduplication pipeline and evaluat- linkage for large-scale data collections, in: Int. Conf.
ing its results, we have benefited from the help of experts. on Extending Database Technology EDBT, volume
Their knowledge was used to determine an initial set of 426, ACM, 2010, pp. 525–536.
attributes used for comparing records and choosing simi- [10] M. A. Hernández, S. J. Stolfo, The merge/purge
larity thresholds. The pipeline was tuned in an iterative problem for large databases, in: ACM SIGMOD Int.
way, each time being based on the input from the experts Conf. on Management of Data, ACM Press, 1995,
evaluating the obtained results. pp. 127–138.</p>
        <p>In particular, grouping clients into clicks turned out to [11] B. Ramadan, P. Christen, H. Liang, R. W. Gayler,
be very useful - it provided a holistic view of customers’ Dynamic sorted neighborhood indexing for
realrepresentations and allowed to clearly identify duplicates. time entity resolution, ACM Journal of Data and
In general, the proposed deduplication approach al- Information Quality 6 (2015) 15:1–15:29.
lowed for the proper implementation of FI business goals. [12] P. Christen, Data Matching - Concepts and
Techniques for Record Linkage, Entity Resolution, and
Acknowledgements. The work of Mariusz Sienkiewicz Duplicate Detection, Data-Centric Systems and
Apis supported by the Applied Doctorate grant no. plications, Springer, 2012.</p>
        <p>DWD/4/24/2020 from the Polish Ministry of Education [13] F. Naumann, Similarity measures, Hasso Plattner
and Science. Institut, 2013.
[14] M. Alamuri, B. R. Surampudi, A. Negi, A survey of
References distance/similarity measures for categorical data,
in: Int. Joint Conf. on Neural Networks (IJCNN),
[1] M. Sienkiewicz, R. Wrembel, Managing data in IEEE, 2014, pp. 1907–1914.</p>
        <p>a big financial institution: Conclusions from a [15] S. Boriah, V. Chandola, V. Kumar, Similarity
mear&amp;d project, in: Proc. of the Workshops of the sures for categorical data: A comparative
evaluaEDBT/ICDT 2021 Joint Conference, volume 2841 of tion, in: SIAM Int. Conf. on Data Mining (SDM),
CEUR Workshop Proceedings, CEUR-WS.org, 2021. SIAM, 2008, pp. 243–254.
[2] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, [16] P. Christen, A comparison of personal name
matchDuplicate record detection: A survey, IEEE Trans. ing: Techniques and practical issues, in: Int. Conf.</p>
        <p>Knowl. Data Eng.g 19 (2007) 1–16. on Data Mining (ICDM), IEEE Computer Society,
[3] H. Köpcke, E. Rahm, Frameworks for entity match- 2006, pp. 290–294.</p>
        <p>ing: A comparison, Data &amp; Knowledge Engineering [17] S. Sarawagi, A. Bhamidipaty, Interactive
dedupli69 (2010) 197–210. cation using active learning, in: ACM SIGKDD Int.
Conf. on Knowledge Discovery and Data Mining,</p>
        <p>ACM, 2002, pp. 269–278. G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra,
[18] Y. Ma, T. Tran, Typimatch: type-specific unsuper- Deep learning for entity matching: A design space
vised learning of keys and key values for heteroge- exploration, in: SIGMOD Int. Conf. on Management
neous web data integration, in: ACM Int. Conf. on of Data, ACM, 2018, pp. 19–34.</p>
        <p>Web Search and Data Mining (WSDM), ACM, 2013, [32] M. Paganelli, F. D. Buono, M. Pevarello, F. Guerra,
pp. 325–334. M. Vincini, Automated machine learning for
en[19] R. Carraghan, P. M. Pardalos, An exact algorithm tity matching tasks, in: Int. Conf. on Extending
for the maximum clique problem, Operations Re- Database Technology EDBT, OpenProceedings.org,
search Letters 9 (1990) 375–382. 2021, pp. 325–330.
[20] D. R. Wood, An algorithm for finding a maximum [33] I. Bhattacharya, L. Getoor, A latent dirichlet model
clique in a graph, Operations Research Letters 21 for unsupervised entity resolution, in: SIAM Int.
(1997) 211–217. Conf. on Data Mining, SIAM, 2006, pp. 47–58.
[21] C. Bron, J. Kerbosch, Finding all cliques of an undi- [34] M. Gheini, M. Kejriwal, Unsupervised product
enrected graph (algorithm 457), Communications of tity resolution using graph representation learning,
the ACM 16 (1973) 575–576. in: SIGIR Workshop on eCommerce @ ACM
SI[22] V. Christophides, V. Efthymiou, T. Palpanas, G. Pa- GIR Int. Conf. on Research and Development in
padakis, K. Stefanidis, An overview of end-to-end Information Retrieval, volume 2410, CEUR-WS.org,
entity resolution for big data, ACM Computing 2019.</p>
        <p>Surveys 53 (2021) 127:1–127:42. [35] U. Brunner, K. Stockinger, Entity matching on
un[23] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, structured data: An active learning approach, in:
Blocking and filtering techniques for entity resolu- Swiss Conf. on Data Science SDS, IEEE, 2019, pp.
tion: A survey, ACM Computing Surveys 53 (2020) 97–102.</p>
        <p>31:1–31:42. [36] X. Chen, Y. Xu, D. Broneske, G. C. Durand, R. Zoun,
[24] W. W. Cohen, J. Richman, Learning to match and G. Saake, Heterogeneous committee-based active
cluster large high-dimensional data sets for data learning for entity resolution (healer), in: European
integration, in: ACM SIGKDD Int. Conf. on Knowl- Conf. on Advances in Databases and Information
edge Discovery and Data Mining, ACM, 2002, pp. Systems ADBIS, volume 11695 of LNCS, Springer,
475–480. 2019, pp. 69–85.
[25] M. Kejriwal, D. P. Miranker, An unsupervised al- [37] A. Jain, S. Sarawagi, P. Sen, Deep indexed active
gorithm for learning blocking schemes, in: IEEE learning for matching heterogeneous entity
repreInt. Conf. on Data Mining, IEEE Computer Society, sentations, VLDB Endowment 15 (2021) 31–45.
2013, pp. 340–349. [38] V. V. Meduri, L. Popa, P. Sen, M. Sarwat, A
compre[26] W. Shen, X. Li, A. Doan, Constraint-based entity hensive benchmark framework for active learning
matching, in: Nat. Conf. on Artificial Intelligence methods in entity matching, in: SIGMOD Int. Conf.
and Innovative Applications of Artificial Intelli- on Management of Data, ACM, 2020, pp. 1133–1147.
gence Conf., AAAI Press / The MIT Press, 2005, [39] M. Sariyar, A. Borg, K. Pommerening, Active
learnpp. 862–867. ing strategies for the deduplication of electronic
[27] M. A. Hernández, S. J. Stolfo, Real-world data is patient data using classification trees, Journal of
dirty: Data cleansing and the merge/purge problem, Biomedical Informatics 45 (2012) 893–900.
Data Mining and Knowledge Discovery 2 (1998) 9– [40] P. Nodet, V. Lemaire, A. Bondu, A. Cornuéjols,
37. A. Ouorou, From weakly supervised learning to
[28] L. O. Evangelista, E. Cortez, A. S. da Silva, W. M. Jr., biquality learning: an introduction, in: Int. Joint
Adaptive and flexible blocking for record linkage Conf. on Neural Networks (IJCNN), IEEE, 2021, pp.
tasks, Journal of Information and Data Management 1–10.</p>
        <p>1 (2010) 167–182. [41] A. Zakrzewska, D. A. Bader, Aging data in dynamic
[29] M. Michelson, C. A. Knoblock, Learning blocking graphs: A comparative study, in: Int. Conf. on
schemes for record linkage, in: Nat. Conf. on Ar- Advances in Social Networks Analysis and Mining,
tificial Intelligence and Innovative Applications of ASONAM, IEEE Computer Society, 2016, pp. 1055–
Artificial Intelligence Conf., AAAI Press, 2006, pp. 1062.</p>
        <p>440–445.
[30] M. Bilenko, R. J. Mooney, Adaptive duplicate
detection using learnable string similarity measures, in:
ACM SIGKDD Int. Conf. on Knowledge Discovery
and Data Mining, ACM, 2003, pp. 39–48.
[31] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>