=Paper= {{Paper |id=Vol-3651/DARLI-AP_paper2 |storemode=property |title=Statistical Modeling vs. Machine Learning for Deduplication of Customer Records |pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-2.pdf |volume=Vol-3651 |authors=Witold Andrzejewski,Bartosz Bebel,Pawel Boinski,Justyna Kowalewska,Agnieszka Marszalek,Robert Wrembel |dblpUrl=https://dblp.org/rec/conf/edbt/AndrzejewskiBBK24 }} ==Statistical Modeling vs. Machine Learning for Deduplication of Customer Records== https://ceur-ws.org/Vol-3651/DARLI-AP-2.pdf
                                Statistical Modeling vs. Machine Learning for Deduplication
                                of Customer Records (industrial paper)
                                Witold Andrzejewski1 , Bartosz Bębel1 , Paweł Boiński1 , Justyna Kowalewska3 ,
                                Agnieszka Marszałek3 and Robert Wrembel1,2
                                1
                                  Poznan University of Technology, Poznań, Poland
                                2
                                  Interdisciplinary Centre for Artificial Intelligence and Cybersecurity, Poznań, Poland
                                3
                                  PKO BP, Warsaw, Poland


                                                                             Abstract
                                                                             Large companies typically face a problem of multiple database records describing the same physical object (a.k.a. duplicates).
                                                                             There are multiple sources of duplicates, e.g., using multiple but not synchronized data repositories, applications that do not
                                                                             check for duplicates before inserting data into a repository, and data errors. This problem is of particular importance while
                                                                             dealing with personal data, e.g., in healthcare, banking, insurance. To handle the problem of duplicates, the state-of-the-art
                                                                             data deduplication pipeline was developed. The pipeline is equipped with multiple complex algorithms. A promising direction
                                                                             in data deduplication is machine learning. In this paper, we report our experience in researching and developing two
                                                                             approaches to deduplication of customer records in a financial institution. These approaches are based on: (1) statistical
                                                                             modeling and (2) machine learning. In particular, this paper summarizes our findings from comparing these two approaches.
                                                                             The reported research was done within a R&D project for the biggest Polish bank - PKO BP.

                                                                             Keywords
                                                                             data quality, data deduplication, statistical modeling, machine learning, classification



                                1. Introduction                                                                                                              pairs, and
                                                                                                                                                           • entity clustering, which creates larger clusters of
                                Large organizations/companies face the problem of du-                                                                        similar records.
                                plicated data. This problem is frequent for companies
                                that store customer data. Duplicated and outdated data                 Each of these tasks is supported by multiple algorithms.
                                cause economic losess, increase customer dissatisfaction,           Some of them apply very complex statistical models
                                and deteriorate the reputation of a company. For these              (SM), whereas others are based on standard machine
                                reasons, data integration, cleaning, and deduplication of           learning (ML), including deep learning (DL). The ML
                                customer records are one of the core processes in data              techniques typically apply various classification mod-
                                governance.                                                         els to divide records into classes of matches, probably
                                   Data deduplication has been extensively researched               matches, and non-matches, e.g., [4, 5, 6]. The DL tech-
                                in multiple research centers worldwide. The research                niques apply language models for both record blocking
                                resulted in a base-line data deduplication pipeline,                and record matching, e.g., [7, 8, 9, 10, 11].
                                e.g., [1, 2, 3]. It has become a standard pipeline for mul-            Since there are two families of approaches to data dedu-
                                tiple data deduplication projects in various application            plication, a question is which family offers more accu-
                                domains. The pipeline includes four basic tasks, namely:            rate deduplication models. In this paper, we outline our
                                                                                                    experience and findings on designing a deduplication
                                      • blocking (a.k.a. indexing), which arranges records pipeline for customer data. We designed two versions of
                                         into groups, such that each group is likely to in- the pipeline. The first one uses statistical modeling for dis-
                                         clude duplicates,                                          covering duplicates. The second one uses ML techniques.
                                      • block processing (a.k.a. filtering), which elimi- Both pipelines were developed within a R&D project for
                                         nates records that do not have to be compared,             the biggest Polish Bank PKO BP (https://pkobp.pl). The
                                      • entity matching (a.k.a. similarity computation), pipelines were tested on a real data set of over 5 million
                                         which computes similarity values between record of customer records. To the best of our knowledge, this
                                                                                                    is the biggest deduplication experiment comparing SM
                                Published in the Proceedings of the Workshops of the EDBT/ICDT 2024
                                                                                                    and ML approaches, reported in the research literature.
                                Joint Conference (March 25-28, 2024), Paestum, Italy
                                $ robert.wrembel@cs.put.poznan.pl (R. Wrembel)
                                 0000-0001-9486-929X (W. Andrzejewski); 0000-0002-6426-3809
                                (B. Bębel); 0000-0003-4914-9394 (P. Boiński); 0000-0001-6037-5718                                                     2. Tasks in the SM pipeline
                                (R. Wrembel)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative
                                                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                                                      We built the SM pipeline as the extension of the base-line
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        data deduplication pipeline (BLDDP), to serve the par-




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
 SMT1:          SMT2: choosing   SMT3: choosing   SMT4: choosing   SMT5: finding       SMT6:            SMT7:             SMT8: building
 choosing       a method for     attributes for   similarity       attribute weights   buliding pairs   buliding record   record
 grouping       comparing        comparing        measures         and similarity      of similar       similarity        similarity
 attributes     records          record pairs     for attributes   thresholds          records          graph             subgraphs
 [block         [block           [entity          [entity          [entity             [entity          [entity           [entity
 building]      processing]      matching]        matching]        matching]           clustering]      clustering]       clustering]


Figure 1: The tasks in the deduplication pipeline based on statistical modeling




ticular goals of the project. First, our pipeline explicitly       its values. For this reason, in our project, the selection
includes all steps that we found to be crucial for the dedu-       of the most suitable similarity measures for our problem
plication process (whereas in the BLDDP, some steps are            was based on excessive experimental evaluation [16, 17].
implicit). Second, the last two tasks in our pipeline ex-             SMT5: Measures selected in SMT4 are applied to com-
tend the BLDDP and they allow to further merge groups              puting similarities of corresponding attribute values, con-
of similar records into subgraphs. Third, our pipeline             stituting customer pairs of records. An adequate similar-
accepts partially dirty data, as in practice it is impossible      ity measure is applied to every pair of attributes. Based
to perfectly clean all data. The SM pipeline applied in the        on attribute similarities, an overall record similarity value
project is shown in Figure 1.                                      is computed. In a simple scenario, all compared attributes
   SMT1: In this task, grouping attributes are selected,           are equally important. In practice, some attributes may
which allow to co-locate similar records in the same               contribute more to records similarity than others. For
group. In our project, initially 20 candidate grouping at-         example, a last name is more important (usually more
tributes were selected by domain experts. The attributes           clean and not null) than an email address (frequently
allowed either to identify customers, or they were error-          null and changing in time). For this reason, the com-
free and not null, or their values haven’t changed over            pared attributes must be weighted, resulting in an overall
time. The candidate attributes were further verified by            weighted similarity of records. The first challenge in
means of a method that we developed. For each attribute,           task SMT5 is to define adequate weights for attributes.
the method computed a value representing its fit as a              In practice, these weights are defined by a try and error
grouping attribute [12]. Finally, the obtained ranking of          procedure and by applying expert knowledge. Based on
attributes was verified by domain experts again. Based             the similarity of a pair of records, the pair is classified
on their input, the final set of grouping attributes was           either as duplicates (class T), or probably duplicates (class
selected.                                                          P), or non-duplicates (class N). For this kind of classifica-
   SMT2: Using grouping attributes obtained from SMT1,             tion, the so-called similarity thresholds have to be defined
records are arranged into groups via sorting. The method           for each class. These thresholds impact the number of
we used is very similar to the one described in [13], how-         true positives and false negatives. Setting adequate val-
ever multiple sortings were used.                                  ues of the thresholds is another challenge in SMT5. In
   SMT3: The goal of this task is to select attributes that        practice, the thresholds are defined experimentally with
will contribute to assessing a similarity value between            the support of domain experts [14]. In our project, to
records in each pair. Attributes that: (1) represent record        find attribute weights and similarity thresholds we apply
identifiers, (2) do not include nulls, (3) include cleaned         mathematical programming [17].
values, (4) include unified (homogenized) values, (5) do              SMT6: To compare records in a group, a popular tech-
not change values over time are good candidates. Unfor-            nique, called sorted neighborhood is used. It accepts one
tunately, in real scenarios, attributes exposing all these         parameter that is the size of a sliding window in which
characteristics are often unavailable. In our project, the         records are compared. This parameter impacts the com-
set of attributes selected for comparing record pairs is           putational performance and the number of discovered
based on the aforementioned preferable attribute char-             (potential) duplicates. Defining the size of the sliding
acteristics and on expert knowledge. The set includes              window is challenging. A window that is too narrow
18 attributes describing individual customers (including           may prevent from discovering all potential duplicates,
personal data and address).                                        whereas a window that is too wide will allow to discover
   SMT4: Similarities between corresponding attributes             more potential duplicates at a cost of unnecessary time
(selected in SMT3) in pairs of records are compared by             overhead caused by comparing more records - some of
means of similarity measures. The literature lists well            them will be false positives. In our project, the window
over 30 different similarity measures for text data (e.g.,         size was selected by means of excessive experiments [13].
[14, 15]). Unfortunately, there are no rules that would               SMT7: Since similar records may form groups larger
guide a data scientist for selecting the right similarity          than pairs, in order to find such groups, all similar pairs
measure for an attribute having a given characteristic of          of records have to be combined. To this end, we cre-
 MLT1:                             MLT3: record      MLT4: building          MLT5: feature   MLT6: model       MLT7: model
                 MLT2: building
 record                            labeling by       learning set and        engineering     building          testing and
                 labeling rules
 labeling                          rules             test set                                (classifier)      tuning


Figure 2: The tasks in the deduplication pipeline based on machine learning




ated a record similarity graph. In this graph, nodes repre-     was to determine decisions consistent with those of the
sent records and edges represent similarity links between       domain experts. Even for a small set of attributes being
records (labeled with similarity values).                       compared in pairs (18 attributes in our case), attempts
   SMT8: Similar records in the record similarity graph         to create simple rules of type if X then Y else Z turned
can be extracted by cutting the graph into sub-graphs.          out to be challenging. It was due to the very different
In our project, we applied clustering of records by using       scenarios present in the data, in particular, the lack of
similarity values of pairs of records computed in SMT5.         100% similarity between the compared attribute values.
For this purpose, we modified the Highly Connected Sub-         It turned out that rules that only confirmed whether a
graphs (HCS) clustering algorithm [18]. HCS recursively         given pair could be a representative of a given class T, P,
applies the minimum cut to the record similarity graph          or N were much easier to interpret. This led to a set of
(separately for each connected component) until the ob-         𝑚 rules under the following assumptions:
tained sub-graphs are highly connected, i.e. the size of
the minimum cut is larger than the number of nodes di-                  • each rule can return only one answer: T, P or N;
vided by two. The resulting sub-graphs form the groups                  • a rule for a given pair of customer records may
of similar records. The modifications include: using a                    return no answer;
minimum weighted cut instead of a minimum cut, search-                  • for a given pair, successive rules are applied; an
ing for additional edges in sufficiently small subgraphs                  answer can be returned by a set of 0 to 89 rules;
and assigning resulting singletons to the most similar                  • rules have a certain priority - the strongest are
groups after the clustering ends. The performance of                      rules that assign class N, followed by rules that
the modified HCS is close to linear w.r.t. the number of                  assign class T, and the weakest rules are those
pairs, since larger connected components in the similarity                assigning P.
graph are rare.                                                  The above assumptions cause that the final decision
   Notice that the pipeline we proposed, is mapped to the     made by the rules is determined as described in Tab. 1.
BLDDP as follows: SMT1 realizes block building in the         It shows the number of N, P, and T responses generated
BLDDP; SMT2 realizes block processing; SMT3, SMT4,            by a set of rules. Symbol ∗ denotes any number of rules
and SMT5 realize entity matching; SMT6, SMT7, and             returning a given answer. In general, the following four
SMT8 realize entity clustering. A more detailed descrip-      cases are possible. Case 1: when at least one rule returns
tion of our pipeline is available in [17].                    N, then the final decision is N. This is due to the highest
                                                              priority of rules returning N. Case 2: decision P is taken
3. Tasks in the ML pipeline                                   when no rule returns N and no rule returns T and at
                                                              least one rule returns P. Case 3: decision T is taken when
The machine learning pipeline applied in the project is no rule returns N and at least one rule returns T. Case
shown in Figure 2.                                            4: no rule returns an answer. In such a case, given the
   MLT1: in this task, 1000 pairs of records were manu- bank’s conservative policy, the final decision is N, since
ally assigned labels by domain experts. A label indicated a possible false negative is much less costly than a false
one of the three classes, namely T, P, or N. All the la- positive.
bels were roughly equally represented. Since the Bank
database included over 20 millions of customer records, Table 1
it was not possible to label manually a sufficiently large Determination of the final rule decision
sample of record pairs. For this reason, in MLT2 the set                No. N        P    T      Decision
of rules was built manually with the support of domain                  1      >0 ∗       ∗      N
experts. It is important to stress that high quality of these           2      0     >0 0        P
rules is crucial for subsequent tasks. The rules were next              3      0     ∗    >0 T
applied in task MLT3 to automatically label a set of 50                 4      0     0    0      ? (N by default)
000 pairs of customer records.
   It was assumed that the effect of applying the rules          Despite tremendous effort, designing rules that
                                                              matched exactly the experts decisions proved to be im-
possible in a reasonable amount of time. The confusion           classification models were created based on the training
matrix for the set of 89 rules created is shown in Tab. 2.       set. The models included: decision tree, random forest,
In addition to the T, P, and N classes, it is easy to see that   SVM, and feed forward neural network. These models
for a total of 70 pairs, the rules made no decision (column      were built using Python package sklearn. Hyperparame-
None). This means that for 1 000 pairs, the coverage of          ters of the models were adjusted manually or (if needed)
pairs by rules is equal to 93%. It is noteworthy that the        with the help of the optuna package, based on the dataset
assessments made by the rules were more conservative             labeled by domain experts. We also attempted to use an
than those made by the experts, i.e., in the case of con-        auto-ML approach by means of the tpot library, but it
fusion, they assigned class N to pairs considered to be T        was not able to produce any model that would be more
and class P to pairs considered to be T. It is notable that      accurate than the models mentioned above, within the
never a pair of class N was considered either P or T.            given time frame.
                                                                    Finally, in MLT7, the quality of all the produced mod-
Table 2                                                          els was tested based on the testing set obtained from
Confusion matrix for the final set of rules                      the automatically labeled data as well as based on the
                                Rules decision                   set of 1000 pairs manually labeled by domain experts.
                          N      P       T    None               The model quality was measured by means of precision,
                   N      310 0          0    38                 recall, measure F1, and accuracy.
        Expert
                   P      8      373 0        30
       decision
                   T      0      7       232 2
                                                                 4. Development
    In MLT3, from the test set of over 5 million of cus-
                                                                 Pipeline implementation environment: Both the SM
tomer records, 2.5 million of pairs were created and then
                                                                 and ML pipelines were implemented in a typical data
they were labeled by means of the rules. The set of pairs
                                                                 science environment, which included: (1) a relational
to be labeled was not random, and a good portion of
                                                                 database management system to store customer data, (2)
it contained common as well as problematic examples.
                                                                 csv files to store temporary data produced by the tasks in
Therefore, it can be considered that it was somehow bi-
                                                                 the pipelines, (3) spreadsheet files to store groups of du-
ased and unrepresentative for the coverage assessment
                                                                 plicated records, and (4) Python programs and standard
measure. It turned out to be much more interesting to
                                                                 packages to implement tasks in both pipelines.
assess coverage by experimenting with the set of pairs
                                                                    Data size: In the reported project, in the development
obtained by applying step SMT2 (from the SM approach).
                                                                 and testing phases we used 2.5 million of pairs of cus-
For the set of 2.5 million of pairs analyzed, in only 23 352
                                                                 tomer records. In the production system, the pipelines
cases the rules were not able to make a final decision.
                                                                 run on the Bank customer database, which includes over
This means the rule coverage of 99.987%. It should be
                                                                 20 million of records.
noted here that of the 89 rules, the most popular rules
(i.e., those that produced a result for the largest number
of pairs) assigned class N. The most popular rule pro-           5. Results
duced N for more than 83% of pairs. When considering
the entire population of pairs (i.e., any pairs from the set),   Tab. 3 presents precision, recall, measure F1, and accuracy
the result would be even higher, as the chance to find           achieved by the ML models on the dataset of 1000 pairs
duplicates decreases significantly.                              manually labeled by the domain experts. We also add
    In, MLT4, from the set of pairs built in MLT3, a train-      results achieved by the Statistical Modeling approach, for
ing and a testing data sets were created by stratified sam-      comparison. The numbers in brackets represent the rank-
pling. Having created these two subsets, we have added           ing of the values (best-1, worst-5). As it can be observed,
new features inspired by the conditions of the expert            the best performance on this dataset was obtained by
rules (task MLT5).                                               the Statistical Model. The worst result was obtained by
    In MLT6, we run several preparatory experiments to           the decision tree. In the rest of the cases the ranking is
get the "feel" of the data. In one of them, we used PCA          not so obvious, as not all measures point to the same
method to obtain two principal components and rendered           rank. However, since F1 is an aggregation of precision
each example in the dataset of 1000 pairs manually la-           and recall, and the ranking of F1 is consistent with the
beled by domain experts on a two-dimensional scatter             accuracy, we can assume that the second best is random
plot, with classes T, P, and N marked in colors. The result      forest, followed by SVM and feed-forward neural net-
was that classes T and N were separated quite well, while        work (FFNN).
P was scattered all over the plot. This observation meant           The best results obtained via the Statistical Model stem
that class P would be problematic.                               most probably from the fact that all parameters of the
    Having run the preparatory experiments, four different
Table 3
Comparison of models on the expert labeled dataset
                             Decision Tree    Random forest        FFNN         SVM         Statistical model
                Precision     [5] 0.7620        [3] 0.8369       [4] 0.8176   [2] 0.8422       [1] 0.8607
                 Recall       [5] 0.7801        [2] 0.8498       [3] 0.8391   [4] 0.8208       [1] 0.8799
                   F1         [5] 0.7631        [2] 0.8407       [4] 0.8241   [3] 0.8296       [1] 0.8673
                Accuracy      [5] 0.7612        [2] 0.8383       [4] 0.8201   [3] 0.8276       [1] 0.8640


Table 4
Comparison of models on the testing dataset
                            Decision Tree     Random Forest       FFNN           SVM        Statistical model
               Precision     [2] 0.9577         [1] 0.9629      [3] 0.9444     [4] 0.8898      [5] 0.8051
                Recall       [3] 0.9399          [2] 0.9551     [4] 0.9368    [1] 0.9619       [5] 0.8182
                  F1         [2] 0.9484         [1] 0.9590      [3] 0.9405     [4] 0.9201      [5] 0.8065
               Accuracy      [2] 0.9958         [1] 0.9967      [3] 0.9953     [4] 0.9926      [5] 0.9859


model (mainly attribute weights and similarity thresh-          hyperplane dividing the attribute space, but such planes
olds) were optimized directly based on the expert labeled       are orthogonal to one of the axis, while in mathematical
dataset also used for testing. The same dataset was only        models, the hyperplanes are aligned arbitrarily. Larger
used for optimizing hyperparameters for other models            degree of freedom makes it harder for the mathematical
(random forest, SVM, FFNN). Decision trees were ad-             models to find the original model that produced the labels,
justed manually.                                                which leads to worse results.
   Tab. 4 presents precision, recall, F1, and accuracy             Regardless of the testing dataset, the best results were
achieved by the ML models on the testing dataset, ob-           achieved by the random forest model. Its confusion ma-
tained by means of 89 rules. It is crucial to underline that    trix for the dataset labeled by the domain experts is given
the results presented in the table compare how close are        in Tab. 5. One of the most important observations here
the results of the ML model to the results obtained             is the fact that the number of erroneous classifications
by applying the rules, and not to the ground truth              involving class P is an order of magnitude greater than
(which is not available). The structure of the table is         the number of erroneous classifications involving classes
the same one as Tab. 3.                                         T and N. This observation confirms the result mentioned
   There are a few interesting observations to be made          in Section 3.
here. First, notice that the orders of precision, F1, and ac-
curacy are consistent, while recall is different. However,      Table 5
similarly as before we can ignore precision and recall in       Confusion matrix for random forest
favor of their aggregation, i.e., F1, to obtain the ranking                              Random forest decision
of the models. In this approach, the best model is random                                N     P    T
forest, followed by decision tree, FFNN, SVM, and the                    Expert
                                                                                    N    318 24     6
Statistical Model. Thus, the obtained order is completely                           P    24    319 68
                                                                        decision
different. We explain the results as follows.                                       T    1     28   212
   Since training and testing datasets had labels generated
by the same set of rules, the classifiers that were trained
on the training dataset tried to generalize the set of these
rules. Thus, their performance on the testing dataset is        6. Summary
quite good. The Statistical Model was not built based on
                                                                In this paper, we reported our experience from a R&D
the set of rules at all, and thus, its performance on the
                                                                project on deduplicating customers data. The project
testing dataset is worse. It is also interesting that tree-
                                                                was realized for the biggest Polish Bank PKO BP. While
based classifiers, i.e., decision tree and random forest
                                                                presented solutions were designed explicitly for the Bank
perform better than mathematical models such as FFNN
                                                                dataset, the findings are general and the approach can be
and SVM, which come down to division of attribute space
                                                                fitted to other, similar problems.
by hyperplanes. We suspect that this stems from the fact,
                                                                   In the project, we designed two deduplication
that trees are just representations of sets of rules. Thus,
                                                                pipelines: (1) based on statistical modeling and (2) based
the tree model structure fits better the underlying source
                                                                on machine learning. The SM pipeline extends the stan-
of the labels than the mathematical models. One can also
                                                                dard deduplication pipeline from the literature to the par-
observe that a single test in a node of a tree is also a
ticular characteristics of data being deduplicated and to          Technology (EDBT), OpenProceedings.org, 2020,
the project requirements. The ML pipeline was designed             pp. 463–473.
from scratch within the project. Both pipelines were im- [8] A. Jain, S. Sarawagi, P. Sen, Deep indexed active
plemented, tested on a data set of 2.5 million of pairs of         learning for matching heterogeneous entity repre-
customer records, and verified by domain experts. The              sentations, VLDB Endowment 15 (2021) 31–45.
obtained results were accepted by the Bank. The project        [9] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,
ended with the decision to run the SM pipeline in the              G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra,
production system on the Bank database, which includes             Deep learning for entity matching: A design space
over 20 million of customer records. This pipeline has             exploration, in: SIGMOD Int. Conf. on Management
already been deployed in the Bank.                                 of Data, ACM, 2018, pp. 19–34.
   It must be stressed that the real settings of the [10] S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani,
project differ from the ones assumed in the research               Y. Govind, D. Paulsen, G. Fung, A. Doan, Deep
literature (the ideal ones) among others w.r.t.: (1) the           learning for blocking in entity matching: A de-
quality of data being deduplicated, (2) the sizes of               sign space exploration, Proc. VLDB Endowment 14
deduplicated data, (3) the availability of tagged data for         (2021) 2459–2472.
ML algorithms, (4) available development environments, [11] A. Zeakis, G. Papadakis, D. Skoutas, M. Koubarakis,
(5) the performance of a deduplication pipeline (time              Pre-trained embeddings for entity resolution: An
and quality). The real settings, which are far from the            experimental analysis, Proc. VLDB Endowment 16
ideal ones, made the project very challenging.                     (2023) 2225–2238.
                                                              [12] P. Boiński, M. Sienkiewicz, B. Bębel, R. Wrembel,
Acknowledgements.             The project is supported             D. Gałęzowski, W. Graniszewski, On customer
by the grant from the National Center for Research and             data deduplication: Lessons learned from a r&d
Development no. POIR.01.01.01-00-0287/19.                          project in the financial sector, in: Workshops of the
                                                                   EDBT/ICDT 2022 Joint Conference, volume 3135 of
                                                                   CEUR Workshop Proceedings, CEUR-WS.org, 2022.
References                                                    [13] P. Boinski, W. Andrzejewski, B. Bebel, R. Wrem-
                                                                   bel, On tuning the sorted neighborhood method
  [1] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios,
                                                                   for record comparisons in a data deduplication
      Duplicate record detection: A survey, IEEE Trans-
                                                                   pipeline - industrial experience report, in: Int. Conf.
      actions on Knowledge and Data Engineering 19
                                                                   Database and Expert Systems Applications (DEXA),
      (2007) 1–16.
                                                                   volume 14146 of LNCS, Springer, 2023, pp. 164–178.
  [2] H. Köpcke, E. Rahm, Frameworks for entity match-
                                                              [14] P. Christen, Data Matching - Concepts and Tech-
      ing: A comparison, Data & Knowledge Engineering
                                                                   niques for Record Linkage, Entity Resolution, and
      69 (2010) 197–210.
                                                                   Duplicate Detection, Data-Centric Systems and Ap-
  [3] G. Papadakis, L. Tsekouras, E. Thanos, G. Gian-
                                                                   plications, Springer, 2012.
      nakopoulos, T. Palpanas, M. Koubarakis, Domain-
                                                              [15] F. Naumann, Similarity measures, Hasso Plattner
      and structure-agnostic end-to-end entity resolution
                                                                   Institut, 2013.
      with jedai, SIGMOD Record 48 (2019) 30–36.
                                                              [16] W. Andrzejewski, B. Bebel, P. Boinski,
  [4] X. Chen, Y. Xu, D. Broneske, G. C. Durand, R. Zoun,
                                                                   M. Sienkiewicz, R. Wrembel, Text similarity
      G. Saake, Heterogeneous committee-based active
                                                                   measures in a data deduplication pipeline for
      learning for entity resolution (healer), in: European
                                                                   customers records, in: Int. Workshop on Design,
      Conf. on Advances in Databases and Information
                                                                   Optimization, Languages and Analytical Process-
      Systems ADBIS, volume 11695 of LNCS, Springer,
                                                                   ing of Big Data (DOLAP), volume 3369 of CEUR
      2019, pp. 69–85.
                                                                   Workshop Proceedings, CEUR-WS.org, 2023, pp.
  [5] S. Sarawagi, A. Bhamidipaty, Interactive dedupli-
                                                                   33–42.
      cation using active learning, in: ACM SIGKDD Int.
                                                              [17] W. Andrzejewski, B. Bębel, P. Boiński, R. Wrembel,
      Conf. on Knowledge Discovery and Data Mining,
                                                                   On tuning parameters guiding similarity computa-
      ACM, 2002, pp. 269–278.
                                                                   tions in a data deduplication pipeline for customers
  [6] J. A. Silva, D. A. Pereira, A multiclass classification
                                                                   records: Experience from a r&d project, Informa-
      approach for incremental entity resolution on short
                                                                   tion Systems 121 (2024) 102323.
      textual data, Int. Journal of Business Intelligence
                                                              [18] F. Hüffner, C. Komusiewicz, A. Liebtrau, R. Nieder-
      and Data Mining 18 (2021) 218–245.
                                                                   meier, Partitioning biological networks into highly
  [7] U. Brunner, K. Stockinger, Entity matching with
                                                                   connected clusters with maximum edge coverage,
      transformer architectures - A step forward in data
                                                                   IEEE/ACM Transactions on Computational Biology
      integration, in: Int. Conf. on Extending Database
                                                                   and Bioinformatics 11 (2014) 455–467.