=Paper=
{{Paper
|id=Vol-3651/DARLI-AP_paper2
|storemode=property
|title=Statistical Modeling vs. Machine Learning for Deduplication of Customer Records
|pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-2.pdf
|volume=Vol-3651
|authors=Witold Andrzejewski,Bartosz Bebel,Pawel Boinski,Justyna Kowalewska,Agnieszka Marszalek,Robert Wrembel
|dblpUrl=https://dblp.org/rec/conf/edbt/AndrzejewskiBBK24
}}
==Statistical Modeling vs. Machine Learning for Deduplication of Customer Records==
Statistical Modeling vs. Machine Learning for Deduplication
of Customer Records (industrial paper)
Witold Andrzejewski1 , Bartosz Bębel1 , Paweł Boiński1 , Justyna Kowalewska3 ,
Agnieszka Marszałek3 and Robert Wrembel1,2
1
Poznan University of Technology, Poznań, Poland
2
Interdisciplinary Centre for Artificial Intelligence and Cybersecurity, Poznań, Poland
3
PKO BP, Warsaw, Poland
Abstract
Large companies typically face a problem of multiple database records describing the same physical object (a.k.a. duplicates).
There are multiple sources of duplicates, e.g., using multiple but not synchronized data repositories, applications that do not
check for duplicates before inserting data into a repository, and data errors. This problem is of particular importance while
dealing with personal data, e.g., in healthcare, banking, insurance. To handle the problem of duplicates, the state-of-the-art
data deduplication pipeline was developed. The pipeline is equipped with multiple complex algorithms. A promising direction
in data deduplication is machine learning. In this paper, we report our experience in researching and developing two
approaches to deduplication of customer records in a financial institution. These approaches are based on: (1) statistical
modeling and (2) machine learning. In particular, this paper summarizes our findings from comparing these two approaches.
The reported research was done within a R&D project for the biggest Polish bank - PKO BP.
Keywords
data quality, data deduplication, statistical modeling, machine learning, classification
1. Introduction pairs, and
• entity clustering, which creates larger clusters of
Large organizations/companies face the problem of du- similar records.
plicated data. This problem is frequent for companies
that store customer data. Duplicated and outdated data Each of these tasks is supported by multiple algorithms.
cause economic losess, increase customer dissatisfaction, Some of them apply very complex statistical models
and deteriorate the reputation of a company. For these (SM), whereas others are based on standard machine
reasons, data integration, cleaning, and deduplication of learning (ML), including deep learning (DL). The ML
customer records are one of the core processes in data techniques typically apply various classification mod-
governance. els to divide records into classes of matches, probably
Data deduplication has been extensively researched matches, and non-matches, e.g., [4, 5, 6]. The DL tech-
in multiple research centers worldwide. The research niques apply language models for both record blocking
resulted in a base-line data deduplication pipeline, and record matching, e.g., [7, 8, 9, 10, 11].
e.g., [1, 2, 3]. It has become a standard pipeline for mul- Since there are two families of approaches to data dedu-
tiple data deduplication projects in various application plication, a question is which family offers more accu-
domains. The pipeline includes four basic tasks, namely: rate deduplication models. In this paper, we outline our
experience and findings on designing a deduplication
• blocking (a.k.a. indexing), which arranges records pipeline for customer data. We designed two versions of
into groups, such that each group is likely to in- the pipeline. The first one uses statistical modeling for dis-
clude duplicates, covering duplicates. The second one uses ML techniques.
• block processing (a.k.a. filtering), which elimi- Both pipelines were developed within a R&D project for
nates records that do not have to be compared, the biggest Polish Bank PKO BP (https://pkobp.pl). The
• entity matching (a.k.a. similarity computation), pipelines were tested on a real data set of over 5 million
which computes similarity values between record of customer records. To the best of our knowledge, this
is the biggest deduplication experiment comparing SM
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024
and ML approaches, reported in the research literature.
Joint Conference (March 25-28, 2024), Paestum, Italy
$ robert.wrembel@cs.put.poznan.pl (R. Wrembel)
0000-0001-9486-929X (W. Andrzejewski); 0000-0002-6426-3809
(B. Bębel); 0000-0003-4914-9394 (P. Boiński); 0000-0001-6037-5718 2. Tasks in the SM pipeline
(R. Wrembel)
© 2024 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
We built the SM pipeline as the extension of the base-line
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) data deduplication pipeline (BLDDP), to serve the par-
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
SMT1: SMT2: choosing SMT3: choosing SMT4: choosing SMT5: finding SMT6: SMT7: SMT8: building
choosing a method for attributes for similarity attribute weights buliding pairs buliding record record
grouping comparing comparing measures and similarity of similar similarity similarity
attributes records record pairs for attributes thresholds records graph subgraphs
[block [block [entity [entity [entity [entity [entity [entity
building] processing] matching] matching] matching] clustering] clustering] clustering]
Figure 1: The tasks in the deduplication pipeline based on statistical modeling
ticular goals of the project. First, our pipeline explicitly its values. For this reason, in our project, the selection
includes all steps that we found to be crucial for the dedu- of the most suitable similarity measures for our problem
plication process (whereas in the BLDDP, some steps are was based on excessive experimental evaluation [16, 17].
implicit). Second, the last two tasks in our pipeline ex- SMT5: Measures selected in SMT4 are applied to com-
tend the BLDDP and they allow to further merge groups puting similarities of corresponding attribute values, con-
of similar records into subgraphs. Third, our pipeline stituting customer pairs of records. An adequate similar-
accepts partially dirty data, as in practice it is impossible ity measure is applied to every pair of attributes. Based
to perfectly clean all data. The SM pipeline applied in the on attribute similarities, an overall record similarity value
project is shown in Figure 1. is computed. In a simple scenario, all compared attributes
SMT1: In this task, grouping attributes are selected, are equally important. In practice, some attributes may
which allow to co-locate similar records in the same contribute more to records similarity than others. For
group. In our project, initially 20 candidate grouping at- example, a last name is more important (usually more
tributes were selected by domain experts. The attributes clean and not null) than an email address (frequently
allowed either to identify customers, or they were error- null and changing in time). For this reason, the com-
free and not null, or their values haven’t changed over pared attributes must be weighted, resulting in an overall
time. The candidate attributes were further verified by weighted similarity of records. The first challenge in
means of a method that we developed. For each attribute, task SMT5 is to define adequate weights for attributes.
the method computed a value representing its fit as a In practice, these weights are defined by a try and error
grouping attribute [12]. Finally, the obtained ranking of procedure and by applying expert knowledge. Based on
attributes was verified by domain experts again. Based the similarity of a pair of records, the pair is classified
on their input, the final set of grouping attributes was either as duplicates (class T), or probably duplicates (class
selected. P), or non-duplicates (class N). For this kind of classifica-
SMT2: Using grouping attributes obtained from SMT1, tion, the so-called similarity thresholds have to be defined
records are arranged into groups via sorting. The method for each class. These thresholds impact the number of
we used is very similar to the one described in [13], how- true positives and false negatives. Setting adequate val-
ever multiple sortings were used. ues of the thresholds is another challenge in SMT5. In
SMT3: The goal of this task is to select attributes that practice, the thresholds are defined experimentally with
will contribute to assessing a similarity value between the support of domain experts [14]. In our project, to
records in each pair. Attributes that: (1) represent record find attribute weights and similarity thresholds we apply
identifiers, (2) do not include nulls, (3) include cleaned mathematical programming [17].
values, (4) include unified (homogenized) values, (5) do SMT6: To compare records in a group, a popular tech-
not change values over time are good candidates. Unfor- nique, called sorted neighborhood is used. It accepts one
tunately, in real scenarios, attributes exposing all these parameter that is the size of a sliding window in which
characteristics are often unavailable. In our project, the records are compared. This parameter impacts the com-
set of attributes selected for comparing record pairs is putational performance and the number of discovered
based on the aforementioned preferable attribute char- (potential) duplicates. Defining the size of the sliding
acteristics and on expert knowledge. The set includes window is challenging. A window that is too narrow
18 attributes describing individual customers (including may prevent from discovering all potential duplicates,
personal data and address). whereas a window that is too wide will allow to discover
SMT4: Similarities between corresponding attributes more potential duplicates at a cost of unnecessary time
(selected in SMT3) in pairs of records are compared by overhead caused by comparing more records - some of
means of similarity measures. The literature lists well them will be false positives. In our project, the window
over 30 different similarity measures for text data (e.g., size was selected by means of excessive experiments [13].
[14, 15]). Unfortunately, there are no rules that would SMT7: Since similar records may form groups larger
guide a data scientist for selecting the right similarity than pairs, in order to find such groups, all similar pairs
measure for an attribute having a given characteristic of of records have to be combined. To this end, we cre-
MLT1: MLT3: record MLT4: building MLT5: feature MLT6: model MLT7: model
MLT2: building
record labeling by learning set and engineering building testing and
labeling rules
labeling rules test set (classifier) tuning
Figure 2: The tasks in the deduplication pipeline based on machine learning
ated a record similarity graph. In this graph, nodes repre- was to determine decisions consistent with those of the
sent records and edges represent similarity links between domain experts. Even for a small set of attributes being
records (labeled with similarity values). compared in pairs (18 attributes in our case), attempts
SMT8: Similar records in the record similarity graph to create simple rules of type if X then Y else Z turned
can be extracted by cutting the graph into sub-graphs. out to be challenging. It was due to the very different
In our project, we applied clustering of records by using scenarios present in the data, in particular, the lack of
similarity values of pairs of records computed in SMT5. 100% similarity between the compared attribute values.
For this purpose, we modified the Highly Connected Sub- It turned out that rules that only confirmed whether a
graphs (HCS) clustering algorithm [18]. HCS recursively given pair could be a representative of a given class T, P,
applies the minimum cut to the record similarity graph or N were much easier to interpret. This led to a set of
(separately for each connected component) until the ob- 𝑚 rules under the following assumptions:
tained sub-graphs are highly connected, i.e. the size of
the minimum cut is larger than the number of nodes di- • each rule can return only one answer: T, P or N;
vided by two. The resulting sub-graphs form the groups • a rule for a given pair of customer records may
of similar records. The modifications include: using a return no answer;
minimum weighted cut instead of a minimum cut, search- • for a given pair, successive rules are applied; an
ing for additional edges in sufficiently small subgraphs answer can be returned by a set of 0 to 89 rules;
and assigning resulting singletons to the most similar • rules have a certain priority - the strongest are
groups after the clustering ends. The performance of rules that assign class N, followed by rules that
the modified HCS is close to linear w.r.t. the number of assign class T, and the weakest rules are those
pairs, since larger connected components in the similarity assigning P.
graph are rare. The above assumptions cause that the final decision
Notice that the pipeline we proposed, is mapped to the made by the rules is determined as described in Tab. 1.
BLDDP as follows: SMT1 realizes block building in the It shows the number of N, P, and T responses generated
BLDDP; SMT2 realizes block processing; SMT3, SMT4, by a set of rules. Symbol ∗ denotes any number of rules
and SMT5 realize entity matching; SMT6, SMT7, and returning a given answer. In general, the following four
SMT8 realize entity clustering. A more detailed descrip- cases are possible. Case 1: when at least one rule returns
tion of our pipeline is available in [17]. N, then the final decision is N. This is due to the highest
priority of rules returning N. Case 2: decision P is taken
3. Tasks in the ML pipeline when no rule returns N and no rule returns T and at
least one rule returns P. Case 3: decision T is taken when
The machine learning pipeline applied in the project is no rule returns N and at least one rule returns T. Case
shown in Figure 2. 4: no rule returns an answer. In such a case, given the
MLT1: in this task, 1000 pairs of records were manu- bank’s conservative policy, the final decision is N, since
ally assigned labels by domain experts. A label indicated a possible false negative is much less costly than a false
one of the three classes, namely T, P, or N. All the la- positive.
bels were roughly equally represented. Since the Bank
database included over 20 millions of customer records, Table 1
it was not possible to label manually a sufficiently large Determination of the final rule decision
sample of record pairs. For this reason, in MLT2 the set No. N P T Decision
of rules was built manually with the support of domain 1 >0 ∗ ∗ N
experts. It is important to stress that high quality of these 2 0 >0 0 P
rules is crucial for subsequent tasks. The rules were next 3 0 ∗ >0 T
applied in task MLT3 to automatically label a set of 50 4 0 0 0 ? (N by default)
000 pairs of customer records.
It was assumed that the effect of applying the rules Despite tremendous effort, designing rules that
matched exactly the experts decisions proved to be im-
possible in a reasonable amount of time. The confusion classification models were created based on the training
matrix for the set of 89 rules created is shown in Tab. 2. set. The models included: decision tree, random forest,
In addition to the T, P, and N classes, it is easy to see that SVM, and feed forward neural network. These models
for a total of 70 pairs, the rules made no decision (column were built using Python package sklearn. Hyperparame-
None). This means that for 1 000 pairs, the coverage of ters of the models were adjusted manually or (if needed)
pairs by rules is equal to 93%. It is noteworthy that the with the help of the optuna package, based on the dataset
assessments made by the rules were more conservative labeled by domain experts. We also attempted to use an
than those made by the experts, i.e., in the case of con- auto-ML approach by means of the tpot library, but it
fusion, they assigned class N to pairs considered to be T was not able to produce any model that would be more
and class P to pairs considered to be T. It is notable that accurate than the models mentioned above, within the
never a pair of class N was considered either P or T. given time frame.
Finally, in MLT7, the quality of all the produced mod-
Table 2 els was tested based on the testing set obtained from
Confusion matrix for the final set of rules the automatically labeled data as well as based on the
Rules decision set of 1000 pairs manually labeled by domain experts.
N P T None The model quality was measured by means of precision,
N 310 0 0 38 recall, measure F1, and accuracy.
Expert
P 8 373 0 30
decision
T 0 7 232 2
4. Development
In MLT3, from the test set of over 5 million of cus-
Pipeline implementation environment: Both the SM
tomer records, 2.5 million of pairs were created and then
and ML pipelines were implemented in a typical data
they were labeled by means of the rules. The set of pairs
science environment, which included: (1) a relational
to be labeled was not random, and a good portion of
database management system to store customer data, (2)
it contained common as well as problematic examples.
csv files to store temporary data produced by the tasks in
Therefore, it can be considered that it was somehow bi-
the pipelines, (3) spreadsheet files to store groups of du-
ased and unrepresentative for the coverage assessment
plicated records, and (4) Python programs and standard
measure. It turned out to be much more interesting to
packages to implement tasks in both pipelines.
assess coverage by experimenting with the set of pairs
Data size: In the reported project, in the development
obtained by applying step SMT2 (from the SM approach).
and testing phases we used 2.5 million of pairs of cus-
For the set of 2.5 million of pairs analyzed, in only 23 352
tomer records. In the production system, the pipelines
cases the rules were not able to make a final decision.
run on the Bank customer database, which includes over
This means the rule coverage of 99.987%. It should be
20 million of records.
noted here that of the 89 rules, the most popular rules
(i.e., those that produced a result for the largest number
of pairs) assigned class N. The most popular rule pro- 5. Results
duced N for more than 83% of pairs. When considering
the entire population of pairs (i.e., any pairs from the set), Tab. 3 presents precision, recall, measure F1, and accuracy
the result would be even higher, as the chance to find achieved by the ML models on the dataset of 1000 pairs
duplicates decreases significantly. manually labeled by the domain experts. We also add
In, MLT4, from the set of pairs built in MLT3, a train- results achieved by the Statistical Modeling approach, for
ing and a testing data sets were created by stratified sam- comparison. The numbers in brackets represent the rank-
pling. Having created these two subsets, we have added ing of the values (best-1, worst-5). As it can be observed,
new features inspired by the conditions of the expert the best performance on this dataset was obtained by
rules (task MLT5). the Statistical Model. The worst result was obtained by
In MLT6, we run several preparatory experiments to the decision tree. In the rest of the cases the ranking is
get the "feel" of the data. In one of them, we used PCA not so obvious, as not all measures point to the same
method to obtain two principal components and rendered rank. However, since F1 is an aggregation of precision
each example in the dataset of 1000 pairs manually la- and recall, and the ranking of F1 is consistent with the
beled by domain experts on a two-dimensional scatter accuracy, we can assume that the second best is random
plot, with classes T, P, and N marked in colors. The result forest, followed by SVM and feed-forward neural net-
was that classes T and N were separated quite well, while work (FFNN).
P was scattered all over the plot. This observation meant The best results obtained via the Statistical Model stem
that class P would be problematic. most probably from the fact that all parameters of the
Having run the preparatory experiments, four different
Table 3
Comparison of models on the expert labeled dataset
Decision Tree Random forest FFNN SVM Statistical model
Precision [5] 0.7620 [3] 0.8369 [4] 0.8176 [2] 0.8422 [1] 0.8607
Recall [5] 0.7801 [2] 0.8498 [3] 0.8391 [4] 0.8208 [1] 0.8799
F1 [5] 0.7631 [2] 0.8407 [4] 0.8241 [3] 0.8296 [1] 0.8673
Accuracy [5] 0.7612 [2] 0.8383 [4] 0.8201 [3] 0.8276 [1] 0.8640
Table 4
Comparison of models on the testing dataset
Decision Tree Random Forest FFNN SVM Statistical model
Precision [2] 0.9577 [1] 0.9629 [3] 0.9444 [4] 0.8898 [5] 0.8051
Recall [3] 0.9399 [2] 0.9551 [4] 0.9368 [1] 0.9619 [5] 0.8182
F1 [2] 0.9484 [1] 0.9590 [3] 0.9405 [4] 0.9201 [5] 0.8065
Accuracy [2] 0.9958 [1] 0.9967 [3] 0.9953 [4] 0.9926 [5] 0.9859
model (mainly attribute weights and similarity thresh- hyperplane dividing the attribute space, but such planes
olds) were optimized directly based on the expert labeled are orthogonal to one of the axis, while in mathematical
dataset also used for testing. The same dataset was only models, the hyperplanes are aligned arbitrarily. Larger
used for optimizing hyperparameters for other models degree of freedom makes it harder for the mathematical
(random forest, SVM, FFNN). Decision trees were ad- models to find the original model that produced the labels,
justed manually. which leads to worse results.
Tab. 4 presents precision, recall, F1, and accuracy Regardless of the testing dataset, the best results were
achieved by the ML models on the testing dataset, ob- achieved by the random forest model. Its confusion ma-
tained by means of 89 rules. It is crucial to underline that trix for the dataset labeled by the domain experts is given
the results presented in the table compare how close are in Tab. 5. One of the most important observations here
the results of the ML model to the results obtained is the fact that the number of erroneous classifications
by applying the rules, and not to the ground truth involving class P is an order of magnitude greater than
(which is not available). The structure of the table is the number of erroneous classifications involving classes
the same one as Tab. 3. T and N. This observation confirms the result mentioned
There are a few interesting observations to be made in Section 3.
here. First, notice that the orders of precision, F1, and ac-
curacy are consistent, while recall is different. However, Table 5
similarly as before we can ignore precision and recall in Confusion matrix for random forest
favor of their aggregation, i.e., F1, to obtain the ranking Random forest decision
of the models. In this approach, the best model is random N P T
forest, followed by decision tree, FFNN, SVM, and the Expert
N 318 24 6
Statistical Model. Thus, the obtained order is completely P 24 319 68
decision
different. We explain the results as follows. T 1 28 212
Since training and testing datasets had labels generated
by the same set of rules, the classifiers that were trained
on the training dataset tried to generalize the set of these
rules. Thus, their performance on the testing dataset is 6. Summary
quite good. The Statistical Model was not built based on
In this paper, we reported our experience from a R&D
the set of rules at all, and thus, its performance on the
project on deduplicating customers data. The project
testing dataset is worse. It is also interesting that tree-
was realized for the biggest Polish Bank PKO BP. While
based classifiers, i.e., decision tree and random forest
presented solutions were designed explicitly for the Bank
perform better than mathematical models such as FFNN
dataset, the findings are general and the approach can be
and SVM, which come down to division of attribute space
fitted to other, similar problems.
by hyperplanes. We suspect that this stems from the fact,
In the project, we designed two deduplication
that trees are just representations of sets of rules. Thus,
pipelines: (1) based on statistical modeling and (2) based
the tree model structure fits better the underlying source
on machine learning. The SM pipeline extends the stan-
of the labels than the mathematical models. One can also
dard deduplication pipeline from the literature to the par-
observe that a single test in a node of a tree is also a
ticular characteristics of data being deduplicated and to Technology (EDBT), OpenProceedings.org, 2020,
the project requirements. The ML pipeline was designed pp. 463–473.
from scratch within the project. Both pipelines were im- [8] A. Jain, S. Sarawagi, P. Sen, Deep indexed active
plemented, tested on a data set of 2.5 million of pairs of learning for matching heterogeneous entity repre-
customer records, and verified by domain experts. The sentations, VLDB Endowment 15 (2021) 31–45.
obtained results were accepted by the Bank. The project [9] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,
ended with the decision to run the SM pipeline in the G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra,
production system on the Bank database, which includes Deep learning for entity matching: A design space
over 20 million of customer records. This pipeline has exploration, in: SIGMOD Int. Conf. on Management
already been deployed in the Bank. of Data, ACM, 2018, pp. 19–34.
It must be stressed that the real settings of the [10] S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani,
project differ from the ones assumed in the research Y. Govind, D. Paulsen, G. Fung, A. Doan, Deep
literature (the ideal ones) among others w.r.t.: (1) the learning for blocking in entity matching: A de-
quality of data being deduplicated, (2) the sizes of sign space exploration, Proc. VLDB Endowment 14
deduplicated data, (3) the availability of tagged data for (2021) 2459–2472.
ML algorithms, (4) available development environments, [11] A. Zeakis, G. Papadakis, D. Skoutas, M. Koubarakis,
(5) the performance of a deduplication pipeline (time Pre-trained embeddings for entity resolution: An
and quality). The real settings, which are far from the experimental analysis, Proc. VLDB Endowment 16
ideal ones, made the project very challenging. (2023) 2225–2238.
[12] P. Boiński, M. Sienkiewicz, B. Bębel, R. Wrembel,
Acknowledgements. The project is supported D. Gałęzowski, W. Graniszewski, On customer
by the grant from the National Center for Research and data deduplication: Lessons learned from a r&d
Development no. POIR.01.01.01-00-0287/19. project in the financial sector, in: Workshops of the
EDBT/ICDT 2022 Joint Conference, volume 3135 of
CEUR Workshop Proceedings, CEUR-WS.org, 2022.
References [13] P. Boinski, W. Andrzejewski, B. Bebel, R. Wrem-
bel, On tuning the sorted neighborhood method
[1] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios,
for record comparisons in a data deduplication
Duplicate record detection: A survey, IEEE Trans-
pipeline - industrial experience report, in: Int. Conf.
actions on Knowledge and Data Engineering 19
Database and Expert Systems Applications (DEXA),
(2007) 1–16.
volume 14146 of LNCS, Springer, 2023, pp. 164–178.
[2] H. Köpcke, E. Rahm, Frameworks for entity match-
[14] P. Christen, Data Matching - Concepts and Tech-
ing: A comparison, Data & Knowledge Engineering
niques for Record Linkage, Entity Resolution, and
69 (2010) 197–210.
Duplicate Detection, Data-Centric Systems and Ap-
[3] G. Papadakis, L. Tsekouras, E. Thanos, G. Gian-
plications, Springer, 2012.
nakopoulos, T. Palpanas, M. Koubarakis, Domain-
[15] F. Naumann, Similarity measures, Hasso Plattner
and structure-agnostic end-to-end entity resolution
Institut, 2013.
with jedai, SIGMOD Record 48 (2019) 30–36.
[16] W. Andrzejewski, B. Bebel, P. Boinski,
[4] X. Chen, Y. Xu, D. Broneske, G. C. Durand, R. Zoun,
M. Sienkiewicz, R. Wrembel, Text similarity
G. Saake, Heterogeneous committee-based active
measures in a data deduplication pipeline for
learning for entity resolution (healer), in: European
customers records, in: Int. Workshop on Design,
Conf. on Advances in Databases and Information
Optimization, Languages and Analytical Process-
Systems ADBIS, volume 11695 of LNCS, Springer,
ing of Big Data (DOLAP), volume 3369 of CEUR
2019, pp. 69–85.
Workshop Proceedings, CEUR-WS.org, 2023, pp.
[5] S. Sarawagi, A. Bhamidipaty, Interactive dedupli-
33–42.
cation using active learning, in: ACM SIGKDD Int.
[17] W. Andrzejewski, B. Bębel, P. Boiński, R. Wrembel,
Conf. on Knowledge Discovery and Data Mining,
On tuning parameters guiding similarity computa-
ACM, 2002, pp. 269–278.
tions in a data deduplication pipeline for customers
[6] J. A. Silva, D. A. Pereira, A multiclass classification
records: Experience from a r&d project, Informa-
approach for incremental entity resolution on short
tion Systems 121 (2024) 102323.
textual data, Int. Journal of Business Intelligence
[18] F. Hüffner, C. Komusiewicz, A. Liebtrau, R. Nieder-
and Data Mining 18 (2021) 218–245.
meier, Partitioning biological networks into highly
[7] U. Brunner, K. Stockinger, Entity matching with
connected clusters with maximum edge coverage,
transformer architectures - A step forward in data
IEEE/ACM Transactions on Computational Biology
integration, in: Int. Conf. on Extending Database
and Bioinformatics 11 (2014) 455–467.