=Paper= {{Paper |id=Vol-3135/darliap_paper6 |storemode=property |title=On Customer Data Deduplication: Lessons Learned from a R&D Project in the Financial Sector |pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper6.pdf |volume=Vol-3135 |authors=Pawel Boinski,Mariusz Sienkiewicz,Bartosz Bebel,Robert Wrembel,Dariusz Galezowski,Waldemar Graniszewski |dblpUrl=https://dblp.org/rec/conf/edbt/BoinskiSBWGG22 }} ==On Customer Data Deduplication: Lessons Learned from a R&D Project in the Financial Sector== https://ceur-ws.org/Vol-3135/darliap_paper6.pdf
On Customer Data Deduplication: Lessons Learned from a
R&D Project in the Financial Sector
Paweł Boiński1 , Mariusz Sienkiewicz1 , Bartosz Bębel1 , Robert Wrembel1 , Dariusz Gałęzowski2
and Waldemar Graniszewski3
1
  Poznan University of Technology, Poznań, Poland
2
  DAMA Poland, Warsaw, Poland
3
  Warsaw University of Technology, Warsaw, Poland


                                          Abstract
                                          Despite the fact that financial institutions (FIs) apply data governance strategies and use the most advanced state-of-the-art
                                          data management and data engineering software and systems to support their day-to-day businesses, their databases are
                                          not free from some faulty data (dirty and duplicated). In this paper, we report some conclusions from an ongoing research
                                          and development project for a FI. The goal of this project is to integrate customers’ data from multiple data sources -
                                          clean, homogenize, and deduplicate them. This paper, in particular, focuses on findings from developing customers’ data
                                          deduplication process.

                                          Keywords
                                          data quality, data cleaning, data deduplication pipeline



1. Introduction                                                                          In this paper, we outline our experience and findings
                                                                                      from designing a deduplication pipeline for customers’
Financial institutions (FIs) apply data governance strate- data (Section 2). We discuss approaches that are possi-
gies and use the most advanced state-of-the-art data man- ble for each task in the pipeline and present particular
agement and data engineering software to manage data solutions that were proven to be adequate to solve the
collected by their day-to-day businesses. Unfortunately, addressed problem. Final conclusions are presented in
the application of advanced technologies does not pre- Section 3. Notice that this paper presents findings from
vent from collecting and storing some faulty data - mainly a real R&D project, and therefore, not all details can be
erroneous, outdated, and duplicated, e.g., [1]. Such data revealed, as they are treated as the company know-how.
mainly concern customers, both individuals and institu-
tions.
   Duplicated and outdated data cause economic loses, 2. Deduplication pipeline in the
increase customer dissatisfaction, and deteriorate a repu-                                 project
tation of a FI. For these reasons, data integration, cleaning,
and deduplication of customers’ is one of the processes Inspired by the aforementioned base-line data dedupli-
in data governance.                                                                   cation pipeline (BLDDP), in the described project, we
   In the research literature, a base-line data deduplica- apply an adjusted pipeline that suits the goals of our
tion pipeline has been proposed, e.g., [2, 3, 4]. It has project. There are three basic differences between our
become a standard pipeline for multiple data dedupli- pipeline and the BLDDP. First, in our pipeline, we ex-
cation projects. The pipeline includes four basic tasks, plicitly included all steps that we found to be crucial for
namely: (1) blocking (a.k.a. indexing), which arranges the deduplication process on customers’ data, whereas
records into groups, such that each group is likely to in the BLDDP some steps are implicit. Second, the last
include duplicates, (2) block processing (a.k.a. filtering), task in our pipeline allows to further merge some groups
which eliminates records that do not have to be com- of similar records (cliques), whereas the BLDDP, to the
pared, (3) entity matching (a.k.a. similarity computation), best of our knowledge, does not include this task. Third,
which computes similarity values between record pairs, our pipeline accepts dirty customers data, whereas the
and (4) entity clustering, which creates larger clusters of BLDDP assumes that input data were cleaned beforehand.
similar records.                                                                         Our pipeline includes the following tasks (cf. Figure 1),
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint
                                                                                      which   are outlined in the remainder of the paper: [T1]
Conference (March 29-April 1, 2022), Edinburgh, UK                                    selecting grouping attributes, [T2] selecting attributes
$ robert.wrembel@cs.put.poznan.pl (R. Wrembel)                                        used to compare record pairs, [T3] choosing a method for
 0000-0001-6037-5718 (R. Wrembel)                                                    comparing records, [T4] selecting similarity measures for
         © 2022 Copyright for this paper by its authors. Use permitted under Creative
         Commons License Attribution 4.0 International (CC BY 4.0).                   comparing values of attribute pairs, [T5] defining weights
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
of attributes to compute records similarities and choosing    2.2.2. T2: Attributes for comparing record pairs
similarity thresholds, [T6] building pairs of records hav-
                                                              Having ordered records by the attributes from the rank-
ing high similarity value, [T7] building cliques of similar
                                                              ing obtained from task [T1], the next task is to select
records, [T8] further merging cliques of similar records.
                                                              attributes whose values will be compared in record pairs,
[T1] realizes blocking in BLDDP; [T3] realizes block pro-
                                                              to compute the similarity of records in each pair. Poten-
cessing and entity matching; [T2], [T4], and [T5] realize
                                                              tial candidates for comparison are attributes that: (1) are
entity matching; [T6] and [T7] realize entity clustering.
                                                              record identifiers, (2) do not include nulls, (3) include
                                                              cleaned values, e.g., no typos, no additional erroneous
2.1. Pipeline implementation                                  characters, (4) include unified (homogenized) values, e.g.,
     environments                                             no abbreviations, the same acronyms used throughout
                                                              the whole data set.
Tasks [T1] to [T6] were implemented in parallel in two
                                                                 Unfortunately, in real cases, such attributes often do
alternative environments. The first one is a typical data
                                                              not exist. As it concerns record identifiers, in FI appli-
engineering environment, based on the Oracle DBMS as a
                                                              cations, natural identifiers are typically used, but their
data storage and the PL/SQL programming language for
                                                              values are frequently artificially generated (in cases when
implementing the deduplication pipeline; this environ-
                                                              natural identifiers cannot be used, i.e., a customer is not
ment is a standard one in the FI running the project. The
                                                              able to provide it). Thus, in some cases, artificially gener-
second environment is a typical data science environment,
                                                              ated IDs may have the same values as the natural ones.
based on csv files as a data storage and Python (Anaconda
                                                              Notice that the financial sector is strictly regulated by
or Jupyter-lab, data science packages) for implementing
                                                              means of European law, national law, and recommenda-
the pipeline.
                                                              tions issued by institutions controlling the sector. As a
                                                              consequence, procedures aiming at improving the qual-
2.2. Tasks in the pipeline                                    ity of data in this sector are strictly controlled. For this
                                                              reason, possibilities of applying data cleaning processes
2.2.1. T1: Grouping attributes
                                                              are limited.
The main challenge in grouping records is to select such a       In the described project, the set of attributes selected
set of grouping attributes that would allow to identify the   for comparing record pairs is based on the aforemen-
highest number of potentially duplicate records. Even         tioned preferable attribute characteristics and on expert
though, in the research literature there were proposed 14     knowledge. The set includes 18 attributes describing
different blocking methods [5], there is no single univer-    individual customers (e.g., personal data and address
sal grouping method suitable for all application domains      components) and 24 attributes describing institutional
[6].                                                          customers (e.g., institution names, addresses, type of busi-
   Inspired by [7], we proposed a method based on statis-     ness run).
tical characteristics of customers attributes: (1) the num-
ber of 𝑛𝑢𝑙𝑙𝑠 to the number of 𝑛𝑜𝑡𝑛𝑢𝑙𝑙 values of an at-        2.2.3. T3: A method for comparing records
tribute, (2) the number of 𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑 values to the num-
ber of 𝑛𝑜𝑡𝑛𝑢𝑙𝑙 values of an attribute, (3) the number of      Based on the ranking of grouping attributes obtained
𝑛𝑜𝑡𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑 to the number of 𝑛𝑜𝑡𝑛𝑢𝑙𝑙 values of an at-       from task [T1] (cf. Section 2.2.1), records need to be
tribute, (4) the number of (𝑛𝑜𝑡𝑛𝑢𝑙𝑙 − 𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡)/𝑛𝑜𝑡𝑛𝑢𝑙𝑙       arranged into groups. Next, in each group records are
values of an attribute. These characteristics are com-        compared in pairs. The literature proposes two popular
puted for every attribute being a candidate for grouping.     techniques for grouping, namely: hashing, e.g., [8, 9] or
Additionally, (1) a diversity of values of each attribute     sorting - know as the sorted neighborhood method, e.g.,
is modeled by means of the Gini Index and (2) the size        [10, 11].
of a record group is penalized by means of a quadratic           The sorted neighborhood method accepts one parame-
function with a negative value of coefficient a.              ter that is the size of a sliding window in which records
   Notice that the initial set of potential grouping at-      are compared. The larger the window size is the more
tributes was selected based on expert knowledge. It in-       potential duplicates can be found, but the longer record
cluded 20 attributes. Next, the candidate attributes were     comparison time is, since more records have to be com-
processed by means of our method and their statistical        pared each time. Some experimental evaluations of the
characteristics were computed. The obtained ranking           window size from the literature discuss a typical size that
of attributes was verified by domain experts. Based on        ranges from 2 to 60 records [10].
their input, the final set of attributes was selected for        The sorted neighborhood method is intuitive, has an
arranging (grouping) records.                                 acceptable computational complexity, and is available in
                                                              one of the Python libraries. For this reason, it was applied
in the project. We run a series of experiments in order              cannot be compared; in this case, 𝐴 is not con-
to determine the best window size. Our experiments                   sidered for computing record similarity for pair
showed that the size has to be adjusted experimentally for           (𝑟𝑚 , 𝑟𝑛 );
a particular data set being deduplicated. For comparing            • when records IDs are compared but the value
individual customers’ records we used the window of 20               of the ID for 𝑟𝑚 is artificially generated and the
records, whereas for comparing institutional customers               value of the ID for 𝑟𝑛 is real, then such values
we used a variable window size with the maximum of                   cannot be compared; in this case, such an ID is
200 records.                                                         not considered for computing record similarity
                                                                     for pair (𝑟𝑚 , 𝑟𝑛 );
2.2.4. T4: Similarity measures for text attributes                 • if IDs can be compared (i.e., they include real
                                                                     values), then a binary similarity value is assigned
The literature on data deduplication and similarity mea-
                                                                     of either 1 (IDs are equal) or 0 (IDs are not equal)
sures lists well over 30 different similarity measures for
                                                                     for a compared pair of IDs.
text data, e.g., [12, 13]. One may find in the literature
suggestions, supported by experimental evaluations, on         On top of the rules, for institutional customers we
the applicability of different measures to different text  used equal weights for each attribute being the subject of
data, e.g., [14, 15, 16, 12].                              comparison. Whereas for individual customers, higher
   In our project, we evaluated 44 measures available in   weights were set for ID and last name attributes. These
Python packages. Some measures, e.g., Levenshtein, Jaro,   weights were set based on an iterative experimentation
Jaro-Winkler exist in a few different implementations      process and evaluation of the results by domain experts.
(packages), thus we evaluated these implementations as         Based on the weighted values of similarities between
well. The evaluation was run on three different real data  pairs of individual attributes of records 𝑟𝑚 and 𝑟𝑛 , a to-
sets, i.e., (1) customers’ last names of average length of tal similarity of (𝑟𝑚 , 𝑟𝑛 ) was computed. Let us denote
10.9 characters, (2) street names of avg length of 16 chars,
                                                           it as 𝑟𝑠𝑖𝑚 . Based on its value, a given pair of records
and (3) institution names of avg length of 45.5 chars.     was classified either as similar (matches) or non-similar
Customers’ names included 98% of 1-word names and 2%       (non-matches), or undecided. For this kind of classifica-
of 2-word names, which reflected a real distribution of    tion, the so-called similarity thresholds had to be defined.
such types of names in our customers’ population. All      Again, in practice, these thresholds are defined based on
test data represented true positives, but with typical realthe analysis of the obtained record pairs and based on
errors found by data profiling.                            knowledge of domain experts [12, 17].
   From the evaluation we draw the following conclu-           In our project we applied the same approach. Based on
sions:                                                     the knowledge of the FI experts, the lowest value of 𝑟𝑠𝑖𝑚
     • for short strings, like last names and street names (i.e., for similar records) was set to 0.8 and the highest
       (composed of 7 to 28 characters), the Overlap, value for non-similar was set to 0.6.
       Jaro-Winkler, and StrCmp95 similarity measures
       gave the highest similarity values;                 2.2.6. T6: Building pairs of similar records
     • for long strings, like institution names (composed
                                                           The sorted neighborhood method produces pairs of similar
       of 46 to 116 characters and up to 12 separate
                                                           records with: (1) their overall value of 𝑟𝑠𝑖𝑚 and (2) sim-
       words) the Overlap, Sorensen, and StrCmp95 mea-
                                                           ilarity values for each attribute being compared. These
       sures gave the highest similarity values, there-
                                                           data are stored in a repository, cf. Section 2.1 and visual-
       fore they were recommended for comparing such
                                                           ized in a spreadsheet for expert verification.
       kinds of data.
                                                               2.2.7. T7: Building cliques of similar records
2.2.5. T5: Attribute weights and similarity
       thresholds                                          Since similar records’ pairs may form larger sets, to find
                                                           such sets, all similar pairs have to be combined in a graph,
In task [T5], we applied an iterative process of tuning
                                                           with records representing nodes and labeled edges repre-
weights of attributes used to compute records’ similarity,
                                                           senting similarities between records. In such a graph, a
with the support of domain experts. Additionally, rules
                                                           group of similar records forms a maximal clique. Thus,
had to be defined to decide whether to compare values
                                                           the problem of finding sets of similar records transforms
of a given attribute. Let us assume that a pair of records
                                                           to finding maximal cliques in a graph. In general, it is
𝑟𝑚 and 𝑟𝑛 is compared to compute their similarity value.
                                                           a NP-hard problem [18]. This problem becomes compu-
Some cases handled by the rules include:
                                                           tationally less expensive for sparse graphs, e.g., [19, 20],
      • when 𝑟𝑚 has a defined value of attribute 𝐴 and which is the case of a graph created from similar records.
        the value of 𝐴 of 𝑟𝑛 is null, then the values of 𝐴
For finding maximal cliques a few fast algorithms were         the financial sector. There exist some typos, missing val-
developed.                                                     ues, inconsistent values in attributes storing personal
   One of them is the Bron-Kerbosh algorithm [21], which       data, institution names, and addresses. Moreover, not all
we decided to use for the following reasons. First, it is      natural IDs are reliable. By regulations, even known dirty
frequently used in the community working on graph              customers data cannot be cleaned without an explicitly
processing. Second, it is implemented in multiple pro-         permission of a customer. Getting such permissions from
gramming languages, including Python. Third, its worst         millions of customers in a finite time frame is impos-
case computational complexity is 𝑂(3𝑁/3 ), where 𝑁 de-         sible. For this reason, only simple cleaning is possible,
notes the number of graph nodes. For sparse graphs the         like removing leading or trailing erroneous signs from
complexity is lower.                                           customers addresses. For this reason, in practice the
   The algorithm was used for finding cliques in a graph       deduplication pipeline has to be applied to data that has
composed of 2228580 customers’ nodes. This evaluation          undergone only basic cleaning.
confirmed its efficiency in terms of processing time and
its applicability to the deduplication problem (confirmed      3.2. Data size
by the experts from the FI).
                                                               Most of the methods used in the base-line data deduplica-
2.2.8. T8: Merging cliques of similar records                  tion pipeline were verified on either small real data sets,
                                                               e.g., bibliographical with 32000 records [6, 24, 25, 17, 26],
If the number of similar records is larger than the size       restaurants - 500 records [24, 25], movies - 5000 records
of a sliding window in sorted neighborhood, then a few         [26], or patients - 128000 records [27], or on data sets
cliques are created and all of them contain records that       generated artificially [6, 7].
are similar to each other. Therefore, the final step is to        Whereas, in this paper we reported our experience
merge cliques that include a certain number of common          on deduplicating customers’ data of much larger vol-
records (currently the Jaccard coefficient is used to decide   umes, i.e.,: (1) 2228580 records describing individual cus-
which cliques to merge). We are also experimenting with        tomers and (2) 1185290 records describing institutional
a variable, automatically adjustable window size.              customers. The final goal of the reported project is to ap-
                                                               ply the developed pipeline and techniques to a database
                                                               storing more than 11 million of customers’ records (since
3. Final observations                                          the project is ongoing, this stage will be run at the end
In this paper, we reported our experience from a R&D           of the project).
project for a FI on deduplicating customers’ data. The
project is ongoing (two out of four stages have been al-       3.3. Tagged data for ML
ready realized). In the project, we adapted the standard
deduplication pipeline from the literature to the partic-   Some tasks in the deduplication pipeline can be run
ular characteristics of data being deduplicated and to      with the support of machine learning (ML) techniques,
the project requirements. The whole pipeline was im-        e.g., blocking [28, 29], selecting similarity measures and
plemented and verified by domain experts. The results       thresholds [30], matching similar records [31, 32]. If the
obtained so far were accepted by the FI.                    pipeline applies ML for the entity matching task, it is
   It must be stressed that the reality of the discussed    assumed that there exists a set of training records tagged
project differs from the one assumed in the research lit-   as true positives and true negatives. Unfortunately, in
erature, i.e., (1) the assumption on the cleanness of data  a large FI it is impossible to create such a set of train-
being deduplicated, (2) the sizes of deduplicated data      ing data because of the volume of data to be processed
sets, (3) the availability of tagged data for ML algorithms,by the pipeline. For an original data set composed of
and (4) neglecting data aging process. Neither of these     several million of customers, a training data set of a rea-
assumptions is true in our project, as outlined in the      sonable size should include at least several thousands of
following sections.                                         tagged training records. In reality, such a large number
                                                            of training records is impossible to be created by experts.
                                                            For this reason, in practice, training data are frequently
3.1. Data cleanness                                         unavailable for ML algorithms.
The base-line data deduplication pipeline [22, 2, 3, 23, 4]    In order to overcome this difficulty, unsupervised learn-
assumes that data delivered to the pipeline are clean ing techniques are used, e.g., [33, 34]. Some publications
(e.g., no null values, no spelling errors, homogenized full report on applying active learning techniques to a dedu-
names and abbreviations). Unfortunately, this assump- plication process, e.g., [35, 36, 37, 38, 39] and this direc-
tion in real projects cannot be guaranteed, especially in tion will also be investigated in the project. Currently
we are experimenting with weakly supervised learning             [4] G. Papadakis, L. Tsekouras, E. Thanos, G. Gian-
[40] with the support of the snorkel library.                        nakopoulos, T. Palpanas, M. Koubarakis, Domain-
                                                                     and structure-agnostic end-to-end entity resolution
3.4. Data aging                                                      with jedai, SIGMOD Record 48 (2019) 30–36.
                                                                 [5] A. Colyer, The morning paper on An overview of
An inherent feature of some types of data, is their aging.           end-to-end entity resolution for big data, https:
For example, customers’ last names, identification doc-              //blog.acolyer.org/2020/12/14/entity-resolution/,
uments, different types of postal addresses, and contact             2020.
data (phone numbers, emails) have this feature. Outdated         [6] M. Bilenko, B. Kamath, R. J. Mooney, Adaptive
data impact the possibility to discover duplicate records.           blocking: Learning to scale up record linkage, in:
For this reason, for a deduplication process it would be             IEEE Int. Conf. on Data Mining (ICDM), IEEE Com-
profitable to know which pieces of compared data are                 puter Society, 2006, pp. 87–96.
likely to be outdated.                                           [7] L. de Souza Silva, F. Murai, A. P. C. da Silva, M. M.
   Building data aging models has not been researched                Moro, Automatic identification of best attributes
so far (the only approach addressing a related problem is            for indexing in data deduplication, in: A. Mendel-
[41], but in the context of temporal data). Including aging          zon Int. Workshop on Foundations of Data Manage-
models into a deduplication pipeline seems to be totally             ment, volume 2100 of CEUR Workshop Proceedings,
unexplored field of research either. In the last stage of our        CEUR-WS.org, 2018.
project we aim at developing data aging models based             [8] N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma,
on ML algorithms.                                                    T. Sarlós, Optimal hashing schemes for entity
                                                                     matching, in: Int. World Wide Web Conf. WWW,
3.5. Working with experts                                            2013, pp. 295–306.
                                                                 [9] H. Kim, D. Lee, HARRA: fast iterative hashed record
While designing the deduplication pipeline and evaluat-              linkage for large-scale data collections, in: Int. Conf.
ing its results, we have benefited from the help of experts.         on Extending Database Technology EDBT, volume
Their knowledge was used to determine an initial set of              426, ACM, 2010, pp. 525–536.
attributes used for comparing records and choosing simi-        [10] M. A. Hernández, S. J. Stolfo, The merge/purge
larity thresholds. The pipeline was tuned in an iterative            problem for large databases, in: ACM SIGMOD Int.
way, each time being based on the input from the experts             Conf. on Management of Data, ACM Press, 1995,
evaluating the obtained results.                                     pp. 127–138.
   In particular, grouping clients into clicks turned out to    [11] B. Ramadan, P. Christen, H. Liang, R. W. Gayler,
be very useful - it provided a holistic view of customers’           Dynamic sorted neighborhood indexing for real-
representations and allowed to clearly identify duplicates.          time entity resolution, ACM Journal of Data and
In general, the proposed deduplication approach al-                  Information Quality 6 (2015) 15:1–15:29.
lowed for the proper implementation of FI business goals.       [12] P. Christen, Data Matching - Concepts and Tech-
                                                                     niques for Record Linkage, Entity Resolution, and
Acknowledgements. The work of Mariusz Sienkiewicz                    Duplicate Detection, Data-Centric Systems and Ap-
is supported by the Applied Doctorate grant no.                      plications, Springer, 2012.
DWD/4/24/2020 from the Polish Ministry of Education             [13] F. Naumann, Similarity measures, Hasso Plattner
and Science.                                                         Institut, 2013.
                                                                [14] M. Alamuri, B. R. Surampudi, A. Negi, A survey of
                                                                     distance/similarity measures for categorical data,
References                                                           in: Int. Joint Conf. on Neural Networks (IJCNN),
 [1] M. Sienkiewicz, R. Wrembel, Managing data in                    IEEE, 2014, pp. 1907–1914.
     a big financial institution: Conclusions from a            [15] S. Boriah, V. Chandola, V. Kumar, Similarity mea-
     r&d project, in: Proc. of the Workshops of the                  sures for categorical data: A comparative evalua-
     EDBT/ICDT 2021 Joint Conference, volume 2841 of                 tion, in: SIAM Int. Conf. on Data Mining (SDM),
     CEUR Workshop Proceedings, CEUR-WS.org, 2021.                   SIAM, 2008, pp. 243–254.
 [2] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios,         [16] P. Christen, A comparison of personal name match-
     Duplicate record detection: A survey, IEEE Trans.               ing: Techniques and practical issues, in: Int. Conf.
     Knowl. Data Eng.g 19 (2007) 1–16.                               on Data Mining (ICDM), IEEE Computer Society,
 [3] H. Köpcke, E. Rahm, Frameworks for entity match-                2006, pp. 290–294.
     ing: A comparison, Data & Knowledge Engineering            [17] S. Sarawagi, A. Bhamidipaty, Interactive dedupli-
     69 (2010) 197–210.                                              cation using active learning, in: ACM SIGKDD Int.
                                                                     Conf. on Knowledge Discovery and Data Mining,
     ACM, 2002, pp. 269–278.                                         G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra,
[18] Y. Ma, T. Tran, Typimatch: type-specific unsuper-               Deep learning for entity matching: A design space
     vised learning of keys and key values for heteroge-             exploration, in: SIGMOD Int. Conf. on Management
     neous web data integration, in: ACM Int. Conf. on               of Data, ACM, 2018, pp. 19–34.
     Web Search and Data Mining (WSDM), ACM, 2013,              [32] M. Paganelli, F. D. Buono, M. Pevarello, F. Guerra,
     pp. 325–334.                                                    M. Vincini, Automated machine learning for en-
[19] R. Carraghan, P. M. Pardalos, An exact algorithm                tity matching tasks, in: Int. Conf. on Extending
     for the maximum clique problem, Operations Re-                  Database Technology EDBT, OpenProceedings.org,
     search Letters 9 (1990) 375–382.                                2021, pp. 325–330.
[20] D. R. Wood, An algorithm for finding a maximum             [33] I. Bhattacharya, L. Getoor, A latent dirichlet model
     clique in a graph, Operations Research Letters 21               for unsupervised entity resolution, in: SIAM Int.
     (1997) 211–217.                                                 Conf. on Data Mining, SIAM, 2006, pp. 47–58.
[21] C. Bron, J. Kerbosch, Finding all cliques of an undi-      [34] M. Gheini, M. Kejriwal, Unsupervised product en-
     rected graph (algorithm 457), Communications of                 tity resolution using graph representation learning,
     the ACM 16 (1973) 575–576.                                      in: SIGIR Workshop on eCommerce @ ACM SI-
[22] V. Christophides, V. Efthymiou, T. Palpanas, G. Pa-             GIR Int. Conf. on Research and Development in
     padakis, K. Stefanidis, An overview of end-to-end               Information Retrieval, volume 2410, CEUR-WS.org,
     entity resolution for big data, ACM Computing                   2019.
     Surveys 53 (2021) 127:1–127:42.                            [35] U. Brunner, K. Stockinger, Entity matching on un-
[23] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas,               structured data: An active learning approach, in:
     Blocking and filtering techniques for entity resolu-            Swiss Conf. on Data Science SDS, IEEE, 2019, pp.
     tion: A survey, ACM Computing Surveys 53 (2020)                 97–102.
     31:1–31:42.                                                [36] X. Chen, Y. Xu, D. Broneske, G. C. Durand, R. Zoun,
[24] W. W. Cohen, J. Richman, Learning to match and                  G. Saake, Heterogeneous committee-based active
     cluster large high-dimensional data sets for data               learning for entity resolution (healer), in: European
     integration, in: ACM SIGKDD Int. Conf. on Knowl-                Conf. on Advances in Databases and Information
     edge Discovery and Data Mining, ACM, 2002, pp.                  Systems ADBIS, volume 11695 of LNCS, Springer,
     475–480.                                                        2019, pp. 69–85.
[25] M. Kejriwal, D. P. Miranker, An unsupervised al-           [37] A. Jain, S. Sarawagi, P. Sen, Deep indexed active
     gorithm for learning blocking schemes, in: IEEE                 learning for matching heterogeneous entity repre-
     Int. Conf. on Data Mining, IEEE Computer Society,               sentations, VLDB Endowment 15 (2021) 31–45.
     2013, pp. 340–349.                                         [38] V. V. Meduri, L. Popa, P. Sen, M. Sarwat, A compre-
[26] W. Shen, X. Li, A. Doan, Constraint-based entity                hensive benchmark framework for active learning
     matching, in: Nat. Conf. on Artificial Intelligence             methods in entity matching, in: SIGMOD Int. Conf.
     and Innovative Applications of Artificial Intelli-              on Management of Data, ACM, 2020, pp. 1133–1147.
     gence Conf., AAAI Press / The MIT Press, 2005,             [39] M. Sariyar, A. Borg, K. Pommerening, Active learn-
     pp. 862–867.                                                    ing strategies for the deduplication of electronic
[27] M. A. Hernández, S. J. Stolfo, Real-world data is               patient data using classification trees, Journal of
     dirty: Data cleansing and the merge/purge problem,              Biomedical Informatics 45 (2012) 893–900.
     Data Mining and Knowledge Discovery 2 (1998) 9–            [40] P. Nodet, V. Lemaire, A. Bondu, A. Cornuéjols,
     37.                                                             A. Ouorou, From weakly supervised learning to
[28] L. O. Evangelista, E. Cortez, A. S. da Silva, W. M. Jr.,        biquality learning: an introduction, in: Int. Joint
     Adaptive and flexible blocking for record linkage               Conf. on Neural Networks (IJCNN), IEEE, 2021, pp.
     tasks, Journal of Information and Data Management               1–10.
     1 (2010) 167–182.                                          [41] A. Zakrzewska, D. A. Bader, Aging data in dynamic
[29] M. Michelson, C. A. Knoblock, Learning blocking                 graphs: A comparative study, in: Int. Conf. on
     schemes for record linkage, in: Nat. Conf. on Ar-               Advances in Social Networks Analysis and Mining,
     tificial Intelligence and Innovative Applications of            ASONAM, IEEE Computer Society, 2016, pp. 1055–
     Artificial Intelligence Conf., AAAI Press, 2006, pp.            1062.
     440–445.
[30] M. Bilenko, R. J. Mooney, Adaptive duplicate detec-
     tion using learnable string similarity measures, in:
     ACM SIGKDD Int. Conf. on Knowledge Discovery
     and Data Mining, ACM, 2003, pp. 39–48.
[31] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,